Can gcc make my code parallel?

Can gcc make my code parallel? - c++

I was wondering if there is an optimization in gcc that can make some single-threaded code like the example below execute in parallel. If no, why? If yes, what kind of optimizations are possible?
#include <iostream>
int main(int argc, char *argv[])
{
int array[10];
for(int i = 0; i < 10; ++ i){
array[i] = 0;
}
for(int i = 0; i < 10; ++ i){
array[i] += 2;
}
return 0;
}
Added:
Thanks for OpenMP links, and as much as I think it's useful, my question is related to compiling same code without the need to rewrite smth.
So basically I want to know if:
Making code parallel(at least in some cases) without rewriting it is possible?
If yes, what cases can be handled? If not, why?

The compiler can try to automatically parallelise your code, but it wont do it by creating threads. It may use vectorised instructions (intel intrinsics for an intel CPU, for example) to operate on multiple elements at a time, where it can detect that using those instructions is possible (for example when you perform the same operation multiple times on consecutive elements of a correctly aligned data structure). You can help the compiler by telling it which intrinsic instruction set your CPU supports (-mavx, -msse4.2 ... for example).
You can also use these instructions directly, but it requires a non-trivial amount of work for the programmer. There are also libraries which do this already (see the vector class here Agner Fog's blog).
You can get the compiler to auto-parallelise using multiple threads by using OpenMP (OpenMP introducion), which is more instructing the compiler to auto-parallelise, than the compiler auto-parallelising by itself.

Yes, gcc with -ftree-parallelize-loops=4 will attempt to auto-parallelize with 4 threads, for example.
I don't know how well gcc does at auto-parallelization, but it is something that compiler developers have been working on for years. As other answers point out, giving the compiler some guidance with OpenMP pragmas can give better results. (e.g. by letting the compiler know that it doesn't matter what order something happens in, even when that may slightly change the result, which is common for floating point. Floating point math is not associative.)
And also, only doing auto-parallelization for #pragma omp loops means only the really important loops get this treatment. -ftree-parallelize-loops probably benefits from PGO (profile-guided optimization) to know which loops are actually hot and worth parallelizing and/or vectorizing.
It's somewhat related to finding the kind of parallelism that SIMD can take advantage of, for auto-vectorizing loops. (Which is enabled by default at -O3 in gcc, and at -O2 in clang).

Compilers are allowed to do whatever they want as long as the observable behavior (see 1.9 [intro.execution] paragraph 8) is identical to that specified by the [correct(*)] program. Observable behavior is specified in terms of I/O operations (using standard C++ library I/O) and access to volatile objects (although the compiler actually isn't really required to treat volatile objects special if it can prove that these aren't in observable memory). To this end the C++ execution system may employ parallel techniques.
Your example program actually has no observable outcome and compilers are good a constant folding programs to find out that the program actually does nothing. At best, the heat radiated from the CPU could be an indication of work but the amount of energy consumed isn't one of the observable effects, i.e., the C++ execution system isn't required to do that. If you compile the code above with clang with optimization turned on (-O2 or higher) it will actually entirely remove the loops (use the -S option to have the compiler emit assembly code to reasonably easy inspect the results).
Assuming you have actually loops which are forced to be executed, most contemporary compilers (at least, gcc, clang, and icc) will try to vectorize the code taking advantage of SIMD instructions. To do so, the compiler needs to comprehend the operations in the code to prove that parallel execution doesn't change the results or introduced data races (as far as I can tell, the exact results are actually not necessarily retained when floating point operations are involved as some of the compilers happily parallelize, e.g., loops adding floats although floating point addition isn't associative).
I'm not aware of a contemporary compiler which will utilize different threads of execution to improve the speed of execution without some form of hints like Open MP's pragmas. However, discussion at the committee meetings imply that compiler vendors are considering to so at least.
(*) The C++ standard imposes no restriction on the C++ execution system in case the program execution results in undefined behavior. Correct programs wouldn't invoke any form of undefined behavior.
tl;dr: compilers are allowed but not required to execute code in parallel and most contemporary compilers do so in some situations.

If you want to parallelize your c++ code, you can use openmp. Official documentation can be found here : openmp doc
Openmp provides pragmas so that you can indicate to the compiler that a portion of code has to use a certain number of threads. Sometimes you can do it manually, and some other pragmas can automatically optimize the number of cores used.
The code below is an example of the official documentation :
#include <cmath>
int main() {
const int size = 256;
double sinTable[size];
#pragma omp parallel for
for(int n=0; n<size; ++n) {
sinTable[n] = std::sin(2 * M_PI * n / size);
}
}
This code will automatically parallelize the for loop, this answers your question. There are a lot of other possibilities offered by openmp, you can read the documentation to learn more.
If you need to understand compiling for openmp support, see this stack overflow thread : openmp compilation thread.
Be careful, If you don't use openmp specific options, pragmas will simply be ignored and your code will be run on 1 thread.
I hope this helps.

Related

Can compiler optimize non-related commands to be executed with different cores?

Compiler can change order of non-correlating commands in term of optimization.
Can it also optimize them silently to be executed in different cores?
For example:
...
for (...)
{
//...
int a = a1+a2;
int b = b1+b2;
int c = c1+c2;
int d = d1+d2;
//...
}
...
May it happen that in terms of optimization not just order of execution may be changed, but also amount of cores? Does compiler have any restrictions in standard?
UPD: I'm not asking how to parallelize the code, I'm asking if it was not parallelized explicitly, can it still be parallelized by compiler?

There is more than meets the eyes here. Most likely the instructions (in your example) will end up being run in parallel, but it's not what you think.
There are many levels of hardware parallelism in a CPU, multiple cores being just the highest one 1). Inside a CPU core you have other levels of hardware parallelization that are mostly transparent 2) (you don't control them via software and you don't actually see them, only maybe their side-effects sometimes). Pipelines, extra bus lanes, multiple ALUs (Arithmetic Logic Units) and FPUs (Floating Point Units) per core are some of them.
Different stages of your instructions will be run in parallel in the pipelines (modern x86 processors have over a dozen pipeline stages) and possibly different instructions will run in parallel in different ALUS (modern x86 CPUs have around 5 ALUs per core).
All this happens without the compiler doing anything 2). And it's free (given the hardware, it was not free to add this capabilities in the hardware). Executing the instructions in different cores is not free. Creating of different threads is costly. Moving the data to be available to other cores is costly. Synchronization to wait for the execution from other cores is costly. There is a lot of overhead associated with creating and synchronizing threads. It is just not worth it for small instructions like this. And the cases that would have a real benefit from multi-threading would involve an analysis that is way too complicated today so practically not feasible. Someday in the future will have compilers that will be able to identify that your serial algorithm is actually a sort and efficiently and correctly parallelize it. Until then we have to rely on language support, library support and/or developer support for parallelizing algorithms.
1) well, actually hyper-threading is.
2) As pointed by MSalters:
modern compilers are very much aware of the various ALU's and will do
work to benefit from them. In particular, register assignments are
optimized so you don't have ALU's compete for the same register,
something which may not be apparent from the abstract sequential
model.
All this it indirectly influences the execution to benefit the hardware architecture, there are not explicit instructions or declarations.

Yes, the compiler can do things in any order (including not doing them at all), so long as the observable behaviour generated matches what the observable behaviour of the code should be. Assembly instructions, runtime, thread count, etc. are not observable behaviour.
I should add that it's unlikely a compiler would decide to do this without explicit instruction from the programmer to do so; even though the standard allows it, the compiler exists to help the programmer and randomly starting extra threads would be unexpected in many case

Do I need to disable OpenMP on a 1 core machine explicitly?

I parallelized some C++ code with OpenMP.
But what if my program will work on a 1 core machine?
Do I need disable usage threading at runtime:
Checks cores
If cores > 1 use OpenMP
Else ignore OpenMP devectives
If yes, does OpenMP have a special directive for it?

No, you don't need to disable OpenMP or threading for running on one core; and for situations where you might want to, you're probably better off explicitly re-compiling without OpenMP, although for complex parallelizations there are other measures, mentioned in the comments, that you can take as well.
When running on a single core or even hardware thread, even if you change nothing - not even the number of threads your code launches - correct, deadlock-free threading code should still run correctly, as the operating system schedules the various threads on the core.
Now, that context switching between threads is costly overhead. Typical OpenMP code, which is compute-bound and relies on work sharing constructs to assign work between threads, treats the number of threads as a parameter and launches as many threads as you have cores or hardware threads available. For such code, where you are just using constructs like
#pragma omp parallel for
for (i=0; i<N; i++)
data[i] = expensive_function(i)
then running on one core will likely only use one thread, or you can explicitly set the number of threads to be one using the OMP_NUM_THREADS environment variable. If OpenMP is to use only one thread and the computation is time-consuming enough, the overhead from the threading library in the above loop is negligible. In this case, there's really no disabling of OpenMP necessary; you're just running on one thread. You can also set the number of threads within the program using omp_set_num_threads(), but best practice is normally to do this at runtime.
However, there's a downside. Compiling with OpenMP disables certain optimizations. For instance, because the work decomposition is done at runtime, even loops with compiled-in trip count limits may not be able to, say, be unrolled or vectorized as effectively because it isn't known how many trips through the loop each thread will take. In that case, if you know that your code will be run on a single core, it may be worth doing the compilation without OpenMP enabled as well, and using that binary for single-core runs. You can also use this approach to test to see if the difference in optimizations matters, running the OpenMP-enabled version with OMP_NUM_THREADS=1 and comparing the timing to that of serial binary.
Of course, if your threading is more complex than using simple work sharing constructs, then it starts being harder to make generalizations. If you have assumptions built into your code about how many threads are present - maybe you have an explicit producer/consumer model for processing work, or a data decomposition hardcoded in, either of which are doable in OpenMP - then it's harder to see how things work. You may also have parallel regions which are much less work than a big computational loop; in those cases, where overhead even with one thread might be significant, it might be best to use if clauses to provide explicit serial paths, e.g.:
nThreadMax = imp_get_max_threads();
#pragma omp parallel if (nThreadMax > 1)
if (omp_in_parallel()) {
// Parallel code path
} else {
// Serial code path
}
But now doing compilation without OpenMP becomes more complicated.
Summarizing:
For big heavy computation work, which is what OpenMP is typically used for, it probably doesn't matter; use OMP_NUM_THREADS=1
You can test if it does matter, with overhead and disabled optimizations, by compiling without OpenMP and comparing the serial runtime to the one-thread OpenMP runtime
For more complicated threading cases, it's hard to say much in general; it depends.

I believe there is function called:
omp_get_num_procs()
that will let you know how many processors are available for OpenMP to work on. Then there are many ways to disable OpenMP. From your code you can run:
omp_set_num_threads(1)
Just remember that even on single core you can get some boost with OpenMP. It only depends on the specificity of your case.

can you debug auto vectorized loops?

I'm working on a codebase that has a lot of SIMD intrinsic code. Now that we have AVX2, we still need to have SIMD code that runs on non-AVX2 capable processors, which will be significantly more work. Plus those 128bit lane crossing limitations for AVX2 shuffles also complicates things. For these reasons, it's a good time to rely more on auto vectorization. The main things that scare me are the prospect of a single innocent change killing the parallelism and the prospect of debugging auto-vectorized code in case there is a problem.
I've compiled the following with g++ -O1 -g -ftree-vectorize and attempted to step through with GDB (does anyone know why -ftree-vectorize doesn't work with -O0 ?)
float a[1000], b[1000], c[1000];
int main(int argc, char **argv)
{
for (int i = 0; i < argc; ++i)
c[i] = a[i] + b[i];
return 0;
}
but don't get any meaningful results. For example sometimes the value for i says <optimized out> while other times it jumps by 20.
It seems the main problem is that it's difficult to map the SIMD state to the original C state for debugging. But realistically, can it be done?

Using a debugger on auto-vectorized code is tricky, esp. when you want to inspect variables that need to behave differently (e.g. the loop counter).
You can either use a debug build (-O0 or -Og), or you can understand how the compiler vectorized the code, and examine registers asm and registers. Depending on what kind of bug you need to track down, you might or might not have a problem with an auto-vectorized build.
It sounds from the comments like you're more interested in checking the efficiency of the auto-vectorization, rather than actually debugging to fix logic bugs in your code. Looking at the asm, and benchmarks, is probably your best bet. (even a simple rdtsc before/after a call, or in a unit-test that tests performance as well as correctness.)
Sometimes the compiler will generate multiple versions of a loop, e.g. for the case where the input arrays overlap, and for the case where they don't. Single-stepping (by instruction, with stepi, with layout asm in gdb) can help, until you find the loop that actually does most of the work. Then you can focus on just how it's vectorized. If you want to eliminate the checks and alternate versions, restrict pointers can be helpful. There's also p = __builtin_assume_aligned(p, 16).
You could also use Intel's free code analyzer to attempt to statically analyze how many cycles an iteration takes. Put IACA marks at the top of your loop body and after the closing paren of your loop, and hope GCC puts them in appropriate places in the auto-vectorized loop, and that the inline asm doesn't break auto-vectorizing.
No optimization answer would be complete with a link to http://agner.org/optimize/, so here you go.

OpenMP optimization for specific number of threads

I'm developing a scientific code, so of course speed is of the essence. Now because of that portability is not really an issue and so I know how many openmp threads I will have available already when compiling the program. Can I use this information to perform any additional optimization? If yes, how do I do so?
Since it was pointed out that this question is very broad. I want to reduce it a bit to automatic, i.e. compiler optimization. So setting compiler flags or similar things.
Cheers
-A

Well,you can modify the code such that it can be divided into n independent regions(n=no. of threads).
U should prefer using sections.They give better speedups as compared to parallel for loops due to reduced inter processor communication.

Does MSVC automatically optimize computation on dual core architecture?

Does MSVC automatically optimize computation on dual core architecture?
void Func()
{
Computation1();
Computation2();
}
If given the 2 computation with no relations in a function, does the visual studio
compiler automatically optimize the computation and allocate them to different cores?

Don't quote me on it but I doubt it. The OpenMP pragmas are the closest thing to what you're trying to do here, but even then you have to tell the compiler to use OpenMP and delineate the tasks.
Barring linking to libraries which are inherently multi-threaded, if you want to use both cores you have to set up threads and divide the work you want done intelligently.

No. It is up to you to create threads (or fibers) and specify what code runs on each one. The function as defined will run sequentially. It may switch to another thread (thanks Drew) core during execution, but it will still be sequential. In order for two functions to run concurrently on two different cores, they must first be running in two separate threads.
As greyfade points out, the compiler is unable to detect whether it is possible. In fact, I suspect that this is in the class of NP-Complete problems. If I am wrong, I am sure one of the compiler gurus will let me know.

There's no reliable way for the compiler to detect that the two functions are completely independent and that they have no state. Therefore, there's no way for the compiler to know that it's safe to break them out into separate threads of execution. In fact, threads aren't even part of the C++ standard (until C++1x), and even when they will be, they won't be an intrinsic feature - you must use the feature explicitly to benefit from it.
If you want your two functions to run in independent threads, then create independent threads for them to execute in. Check out boost::thread (which is also available in the std::tr1 namespace if your compiler has it). It's easy to use and works perfectly for your use case.

No. Madness would ensue if compilers did such a thing behind your back; what if Computation2 depended on side effects of Computation1?
If you're using VC10, look into the Concurrency Runtime (ConcRT or "concert") and it's partner the Parallel Patterns Library (PPL)
Similar solutions include OpenMP (kind of old and busted IMO, but widely supported) and Intel's Threading Building Blocks (TBB).

The compiler can't tell if it's a good idea.
First, of course, the compiler must be able to prove that it would be a safe optimization: That the functions can safely be executed in parallel. In general, that's a NP-complete problem, but in many simple cases, the compiler can figure that out (it already does a lot of dependency analysis).
Some bigger problems are:
it might turn out to be slower. Creating threads is a fairly expensive operation. The cost of that may just outweigh the gain from parallelizing the code.
it has to work well regardless of the number of CPU cores. The compiler doesn't know how many cores will be available when you run the program. So it'd have to insert some kind of optional forking code. If a core is available, follow this code path and branch out into a separate thread, otherwise follow this other code path. And again, more code and more conditionals also has an effect on performance. Will the result still be worth it? Perhaps, but how is the compiler supposed to know that?
it might not be what the programmer expects. What if I already create precisely two CPU-heavy threads on a dual-core system? I expect them both to be running 99% of the time. Suddenly the compiler decides to create more threads under the hood, and suddenly I have three CPU-heavy threads, meaning that mine get less execution time than I'd expected.
How many times should it do this? If you run the code in a loop, should it spawn a new thread in every iteration? Sooner or later the added memory usage starts to hurt.
Overall, it's just not worth it. There are too many cases where it might backfire. Added to the fact that the compiler could only safely apply the optimization in fairly simple cases in the first place, it's just not worth the bother.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js