OMP CRITICAL in OpenMP - fortran

What does this mean in FORTRAN?
C$OMP CRITICAL (UNNAMED)
And
C$OMP END CRITICAL (UNNAMED)
Doesn't a line with C mean that is a comment? But apparently when I remove this line, it doesn't work. What is going on here?

This is an OpenMP directive in Fortran, specifically in fixed-form. In your case, the section is called UNNAMED. To cite from the documentation, the critical section
specifies a region of code that must be executed by only one thread at a time.
For fixed-form Fortran, OpenMP directives are prefixed with C$OMP, for free form it is !$OMP.
As with all compiler directives in Fortran, OpenMP directives need to be realized as comments in case your compiler does not understand them.
Critical sections are usually used to avoid deadlocks or race conditions. This is probably the case here, as your code breaks if you remove the directives.

Related

Openmp consumes all the CPU power before even compiling/running the code? (parallel for)

I was looking for a way to parallelize "for loops", without implementing pthread routines, and so on, by myself. I stumbled over openmp, and the #pragma omp parallel for default(none) directive. Since my for loop has several variables which are "shared" (like some integer values, and also some arrays where I want to store stuff I calculate in the loop, at the respective index position), I have added shared(variable1, variable2, ...) and so on. However by doing so, I noticed that the warnings in CLion which highlight the shared variables, won't go away. Furthermore I noticed that when I put the shared clause in my code, all of my 6 CPU cores start getting busy with most of the time 100 percent usage.
This seems super odd to me, since I haven't even compiled the code yet. The cores start working as soon as I put the shared() clause with some variables to the code.
I have never worked with openmp so I dont know, if I am may using it wrong? Would be great if someone can help me out with that, or probably give a hint why this happens.
Edit:
For clarification: With warnings, I mean, that the IDE underlines in red all the variables which seem to be shared. The CPU consumption is by the IDE itself, when adding the shared() clause to the code. However I have no clue why adding this clause would consume all the CPUs this much?

Image stitching is taking more time with Seamfinder and Exposurecompensator

I am trying to stitch images and the code on which I am working on, it uses SeamFinder and ExposureCompensator along with other functions. But while running the code, these two are taking so much of time. Is there any other alternative or is there a way to improve the performance.
Ptr<ExposureCompensator> compensator = ExposureCompensator::createDefault(expos_comp_type);
compensator->feed(corners, images_warped, masks_warped);
seam_finder = makePtr<GraphCutSeamFinder>(GraphCutSeamFinderBase::COST_COLOR);
seam_finder->find(images_warped_f, corners, masks_warped);
The above are the two functions which are taking time.
Please help me in solving the problem.
Thanks in advance.
The ImageStitching via OpenCV is known to be slow in many cases. Maybe you can give Open MP Parallel a shot here and counter the delay you are facing by using parallelization.
OpenMP consists of a set of compiler #pragmas that control how the program works. The pragmas are designed so that even if the compiler does not support them, the program will still yield correct behavior, but without any parallelism.
In cases where different iterations of loops have nothing to do with each other, therefore making these loops a prime target for parallelization. OpenMP effectively exploits these common program characteristics, so it is extremely easy to allow an OpenMP program to use multiple processors simply by adding a few lines of compiler directives into your source code.
In case you are running a loop in which a set of images are being stitched, you can make sure that the stiching for each set of images run parallely.
#pragma omp parallel for
for( ... )
{
// Image-stitching algorithms go here.
}
This compiler directive #pragma opm parallel for tells the compiler to auto-parallelize the for loop with OpenMP.
For non-loops, or just sections of code you can do something of this sort :
#pragma omp parallel sections
{
#pragma omp section
{
DoSomething();
}
#pragma omp section
{
DoSomethingElseParallely();
}
}
I know that the answer might not directly help you out, but might give you some avenues to dig.
You can go through more about the usage of OpenMP loop Parallelism and OpenMP: Sections before using it.

Can gcc make my code parallel?

I was wondering if there is an optimization in gcc that can make some single-threaded code like the example below execute in parallel. If no, why? If yes, what kind of optimizations are possible?
#include <iostream>
int main(int argc, char *argv[])
{
int array[10];
for(int i = 0; i < 10; ++ i){
array[i] = 0;
}
for(int i = 0; i < 10; ++ i){
array[i] += 2;
}
return 0;
}
Added:
Thanks for OpenMP links, and as much as I think it's useful, my question is related to compiling same code without the need to rewrite smth.
So basically I want to know if:
Making code parallel(at least in some cases) without rewriting it is possible?
If yes, what cases can be handled? If not, why?
The compiler can try to automatically parallelise your code, but it wont do it by creating threads. It may use vectorised instructions (intel intrinsics for an intel CPU, for example) to operate on multiple elements at a time, where it can detect that using those instructions is possible (for example when you perform the same operation multiple times on consecutive elements of a correctly aligned data structure). You can help the compiler by telling it which intrinsic instruction set your CPU supports (-mavx, -msse4.2 ... for example).
You can also use these instructions directly, but it requires a non-trivial amount of work for the programmer. There are also libraries which do this already (see the vector class here Agner Fog's blog).
You can get the compiler to auto-parallelise using multiple threads by using OpenMP (OpenMP introducion), which is more instructing the compiler to auto-parallelise, than the compiler auto-parallelising by itself.
Yes, gcc with -ftree-parallelize-loops=4 will attempt to auto-parallelize with 4 threads, for example.
I don't know how well gcc does at auto-parallelization, but it is something that compiler developers have been working on for years. As other answers point out, giving the compiler some guidance with OpenMP pragmas can give better results. (e.g. by letting the compiler know that it doesn't matter what order something happens in, even when that may slightly change the result, which is common for floating point. Floating point math is not associative.)
And also, only doing auto-parallelization for #pragma omp loops means only the really important loops get this treatment. -ftree-parallelize-loops probably benefits from PGO (profile-guided optimization) to know which loops are actually hot and worth parallelizing and/or vectorizing.
It's somewhat related to finding the kind of parallelism that SIMD can take advantage of, for auto-vectorizing loops. (Which is enabled by default at -O3 in gcc, and at -O2 in clang).
Compilers are allowed to do whatever they want as long as the observable behavior (see 1.9 [intro.execution] paragraph 8) is identical to that specified by the [correct(*)] program. Observable behavior is specified in terms of I/O operations (using standard C++ library I/O) and access to volatile objects (although the compiler actually isn't really required to treat volatile objects special if it can prove that these aren't in observable memory). To this end the C++ execution system may employ parallel techniques.
Your example program actually has no observable outcome and compilers are good a constant folding programs to find out that the program actually does nothing. At best, the heat radiated from the CPU could be an indication of work but the amount of energy consumed isn't one of the observable effects, i.e., the C++ execution system isn't required to do that. If you compile the code above with clang with optimization turned on (-O2 or higher) it will actually entirely remove the loops (use the -S option to have the compiler emit assembly code to reasonably easy inspect the results).
Assuming you have actually loops which are forced to be executed, most contemporary compilers (at least, gcc, clang, and icc) will try to vectorize the code taking advantage of SIMD instructions. To do so, the compiler needs to comprehend the operations in the code to prove that parallel execution doesn't change the results or introduced data races (as far as I can tell, the exact results are actually not necessarily retained when floating point operations are involved as some of the compilers happily parallelize, e.g., loops adding floats although floating point addition isn't associative).
I'm not aware of a contemporary compiler which will utilize different threads of execution to improve the speed of execution without some form of hints like Open MP's pragmas. However, discussion at the committee meetings imply that compiler vendors are considering to so at least.
(*) The C++ standard imposes no restriction on the C++ execution system in case the program execution results in undefined behavior. Correct programs wouldn't invoke any form of undefined behavior.
tl;dr: compilers are allowed but not required to execute code in parallel and most contemporary compilers do so in some situations.
If you want to parallelize your c++ code, you can use openmp. Official documentation can be found here : openmp doc
Openmp provides pragmas so that you can indicate to the compiler that a portion of code has to use a certain number of threads. Sometimes you can do it manually, and some other pragmas can automatically optimize the number of cores used.
The code below is an example of the official documentation :
#include <cmath>
int main() {
const int size = 256;
double sinTable[size];
#pragma omp parallel for
for(int n=0; n<size; ++n) {
sinTable[n] = std::sin(2 * M_PI * n / size);
}
}
This code will automatically parallelize the for loop, this answers your question. There are a lot of other possibilities offered by openmp, you can read the documentation to learn more.
If you need to understand compiling for openmp support, see this stack overflow thread : openmp compilation thread.
Be careful, If you don't use openmp specific options, pragmas will simply be ignored and your code will be run on 1 thread.
I hope this helps.

Performance difference using dynamic number of threads

I'm using openMP to parallize some heavy loops, and it works as expected.
Testing showed that this directive gave the most performance:
#pragma omp parallel for num_threads(7)
However, that may differ from machine to machine. Also, i wanted to be able to switch threading on/off using a runtime switch.
Therefore, i figured i could use something like this:
if(shouldThread)
omp_set_num_threads(optimalNumberOfThreadsForThisMachine);
else
omp_set_num_threads(1);
Where on my computer, the optimal number of threads is 7 in this example. Then, use this directive instead:
#pragma omp parallel for
It works well - except that the code compiled with the second directive is about 50% slower. Is this to be expected? I figure the runtime has to do dynamic dispatching and work scheduling, while the compile-time directive can add some sort of optimization, i guess.
Code is compiled with msvc 2013, on an core i7-3740

Using -O2 optimization and OpenMP

Is it possible that the -O2 optimization flag re-arranges code, thereby possibly making a multi-threaded application work as un-intended?
As an example of what I mean by un-intended behavior when code is re-arranged: A variable declared (by the programmer) to be created for each thread is moved outside the #pragma omp parallal such that only one single copy is created, shared by all threads.
No, this cannot happen. OpenMP would not be very useful if the compiler was unrolling the loops or if the program crashed when the compiler reorders loops. The OpenMP directive must specify the dependencies and side effects of the variables and parallel scopes, and the compiler takes them into account when applying the optimization passes.