openmp: seq processing within subroutine - fortran

I would like to carry out calculations in parallel by calling
!$OMP PARALLEL
call sub
!$OMP END PARALLEL
this fires of n (= no. of threads) calls tosub, as desired.
within sub, I am reading from separate data files, with the file and the no. of reads determined from the thread number. This works if I enclose everything in sub in !$OMP ORDERED ... !$OMP END ORDERED.
however, this also causes all threads to run strictly sequentially, i.e. thread 0 runs first and all reads in it are complete before thread 1 starts, etc.
what I would like to achieve though is that all n threads run concurrently and only processing within a thread is sequentially (as reads are from different data files). any idea how? replacing ORDERED by TASK does not help.

Related

OMP : Is REDUCTION mandatory for an array when in a DO loop there is no concurrency between cells?

I wrote the following program that works as expected.
program main
use omp_lib
implicit none
integer :: i,j
integer, dimension(1:9) :: myArray
myArray=0
do j=0,3
!$OMP PARALLEL DO DEFAULT(SHARED)
do i=1,9
myArray(i) = myArray(i) + i*(10**j)
end do
!$OMP END PARALLEL DO
enddo
print *, myArray
end program main
As only one thread writes to the i th cell of myArray, REDUCTION on myArray has not been used. But, I wonder if REDUCTION (+ : myArray) must be added in the OMP DO or if it is unuseful ? In others terms, what is important : the array or the cell of the array ?
What is important : the array or the cell of the array ?
Cells. !$OMP PARALLEL DO DEFAULT(SHARED) is fine as long as the loop is embarrassingly parallel. When Threads operate on the same location in memory, this cause a race condition. When a reduction is performed cell-wise, then REDUCTION (+ : myArray) can be added. That being said, once should note that the full array will likely be replicated in each threads temporary before doing the final reduction of the whole array. Since your loop is embarrassingly parallel, REDUCTION (+ : myArray) is not needed here.
Besides, note that the number of iterations is too small for multiple threads to make the code faster. In fact, it will be slower because of the time spent to create/join threads, the time to distribute the work amongst threads and also because of an effect called false-sharing (which will nearly serialize the execution of the loop here).

How can I create unique threads to handle(iterate) portions of a given number using #pragma - Openmp?

So basically I'm new to Stackoverflow and to the whole OpenMP - #pragma parallel implementation of code... I have a decent idea of how threads work.
I'm trying to make a variable 'p' (that will be incremented within a while loop(NOT for loop)), to be incremented in parts(ranges) by individually assigned threads using #pragma derivatives.
Algorithm Steps
Declare & define all necessary variables, functions and derivatives.
Declare parallel block using #pragma omp parallel
Instruct threads to handle a specific range of values belonging to a certain variable or function.
Instruct threads to divide total iteration size equally and then iterate variable simultaneously, each not necessarily being aware of the others value or iteration number but just focusing on doing their part.
Instruct threads to
(a) handle a specific range of values belonging to a certain variable or function by dividing total iteration size equally and then iterate variable simultaneously.
(b) Pass iterating value to function that is also being called by other #pragma omp parallel threads
These functions will return data back to certain functions that needs them. bool, string, int etc...
total iteration size = 50
How many Threads would divide tits equally... Yes tits is the short term for total_iteration_size... if it offends you then no Pun intended, if it doesn't then have a blast. <^_^>
Something like this:
int p = 0,total_iteration_size;
int tits = total_iteration_size;
while(p<=50)
{
//Create 5 parallel threads to iterate p by 10
Thread1 = p++... 0-10 then print
Thread2 = p++... 11-20 then print
Thread3 = p++... 21-30 then print
Thread4 = p++... 31-40 then print
Thread5 = p++... 41-50 then print
//In theory, Thread1 to Thread5 should iterate simultaneously in parallel the given variable while submission of value to print statement can be handled in any order.
}
The above is the basics of the algorithm. In real application, the iteration involves forwarding of iterated value to inline functions that are also structured to handle tasks same approach - using exact algorithm that handles parallel implementation of functions, iterations & Macros, Each Thread handling a defined range/number of functions to call.
Scenario 2:
Another user case of this algorithm is..
while(p<=30)
{
Create 3 parallel threads to call function Foo() 10x each
Thread1 = Foo();... 10 times
Thread2 = Foo();... 10 times
Thread3 = Foo();... 10 times
}
return Foo();
In the above scenario, function Foo() is called 10 times by each thread. Making a total of 30 calls.
Scenario 3:
Another is....
while(p<=20)
{
//Create 2 parallel threads to iterate p by 10
Thread1 = p++... 0-10 then print
Thread2 = p++... 11-20 then print
Then:
//Create 2 parallel threads to call function Foo(), using p as a passed parameter - 10x each.
Thread1 = Foo(p);... 10 times
Thread2 = Foo(p);... 10 times
}
return Foo(p);
In another scenario, individual threads call a function 10 times, but this time using variable *p* as a parameter.
simultaneously meaning - range of operations being carried out by multiple cores in parallel. core1, core2, core3.... core8 each handling a specific thread.
Sorry don't mean to be picky/choosy but I don't need links or references to external sites of non coded answers... what I need are actual answers written in code.
I hope I was able to describe the design of the algorithm in a way that you guys can assist me with. Thanks in advance.
PS. Thread1 = Foo(p);... 10 times is just the best way I can describe what I want to achieve. Also other cpp files, programs and external functions might also depend on the result data of these parallel threads.
Reference:
https://tildesites.bowdoin.edu/~ltoma/teaching/cs3225-GIS/fall17/Lectures/openmp.html

Openmp: How to collect an array from different threads?

I am a OpenMP newbie and I am stuck at a problem! I have an array which is summed in a loop but I am having problems parallelizing this. Can you suggest how to do this? The main loop is sketched as follows:
REAL A (N) ! N is an integer
!$OMP PARALLEL
DO I=1,10 ! just an illustration
DO J=1,10 ! can't post real code sorry, illegal
!$OMP DO
DO K=1,100
call messy_subroutine_that_sums_A(I,J, K, A) ! each thread returns its own sum
! each thread returns its own A right? and needs to summed quickly
END DO
!$OMP END DO
END DO
END DO
SUBROUTINE messy_subroutine_that_sums_A(I,J, K, A)
REAL A(N) ! an array
! basically adds some stuff to the previous value of A
A=A+junk ! this is really junk
END SUBROUTINE messy_subroutine_that_sums_A
my problem is that all my attempts to collect A from all the threads have failed. If you notice A is summed over outer loops as well. What is the correct and a fast procedure to collect A from all the arrays as a sum. Secondly, my question is not just a Fortran question, it applies equally to C and C++. It is a conceptual question.
Actually OpenMP does support reduction on arrays in C, C++ (declared recently in OpenMP 4.1 comment draft release), and of cause in Fortran.
Not all implementations may support this feature, so you would better first check if you compiler supports it. To see if you have correct calculations you can start with placing A=A+junk into critical section:
!$omp critical
A=A+junk
!$omp end critical
This should give you same correct answer in all threads after the OMP DO loop.
Then you can optimize performance using array reduction instead of critical on OMP DO loop, again having same correct answer in all threads after the loop.
Then you can further optimize performance moving the reduction to OMP PARALLEL, but you won't be able to check values in all threads in this case, because all threads will work with private copies of the array A, and thus have different partial values. Final correct answer will be available in master thread only, after the OMP PARALLEL.
Note, you don't need to declare loop iteration variables private in Fortran, as they should be automatically made private.

Fortran OpenMP: creating threads only once

I am working in Fortran. My purpose is to parallelize with OPENMP a program of this kind:
do t=1, t_fin (sequential)
a sequence of do loops (not sequential)
end do
where the cycle is sequential (temporal). I have tried to create threads at the beginning of each sequential cycle iteration, but this makes the code slower than it could be. Therefore, my question is if it possible to create the threads only once before starting the cycle. Practically:
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP CRITICAL
do t=1, t_fin
sequence of OMP do
end do
!$OMP END CRITICAL
!$OMP END PARALLEL
I have tried in this way, but it works as if it were only one thread. I suppose this depends on the fact that there is an external critical section that includes the omp do. However, I would like to execute the internal omp do with more than one thread. Is there a way to obtain this?
If I understand your question correctly, you want to avoid creating threads in each iteration of the outer loop. This can be achieved by taking the OMP PARALLEL directive outside the loop, and leave the other statements inside the loop. Then I see two parallelization schemes.
You could either parallelize the inner loops:
!$OMP PARALLEL
do t=1, t_fin (sequential)
!$OMP DO
first loop
!$OMP END DO
!$OMP DO
second loop
!$OMP END DO
!...
end do
!$OMP END PARALLEL
or, use sections to run the loops in parallel (if they are independent of each other):
!$OMP PARALLEL
do t=1, t_fin (sequential)
!$OMP SECTIONS
!$OMP SECTION
first loop
!$OMP END SECTION
!$OMP SECTION
second loop
!$OMP END SECTION
!...
!$OMP END SECTIONS
end do
!$OMP END PARALLEL
In both versions, the threads are created outside the loop.

OpenMP construct to continue execution as soon as at least 1 thread is finished

I have a need to continue execution as soon as one of the threads has finished execution. The logic inside the parallel section with ensure that everything has been completed satisfactorily. I have nested parallelisation therefore I put some of the top level threads to Sleep when data is not ready to be processed as not to consume computation power. So when one of the top level threads finishes I want to continue execution and not wait for the other threads to wake up and naturally return.
I use
#pragma omp parallel for num_threads(wanted_thread_no)
How do you parallelise? Do you use tasks, sections or?
If I understood correct and if you using the task primitive you can use the #pragma omp parallel nowait after the last task.
Check this pdf on page 13 (of the pdf).
http://openmp.org/wp/presos/omp-in-action-SC05.pdf
It explicitly says:
By default, there is a barrier at the end of the “omp for”. Use the
“nowait” clause to turn off the barrier.
#pragma omp for nowait “nowait” is useful between two consecutive, independent omp for loops.
Is this what you want?
Also take a look on this as well, even if it says the same thing.
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf