With an OpenACC loop, does each thread get private copies of scalars? - fortran

I have a pretty simple code fragment:
$acc data copy(a(:),b(:))
$acc kernels
$acc loop vector
do i=1,1000
x = a(i)
b(i) = sqrt(x)
enddo
$acc end kernels
$acc end data
And of course, I could dispense with x easily, but this is an example and x is the point of my question, which is: Does every thread here get its own copy of x automatically, or should I declare it private to keep the various threads from clobbering it?

In OpenACC, scalars are firstprivate by default so typically there's no need to put them in a "private" clause. The only times you really need to use the "private" clause is for arrays or when a scalar "escapes" the compute region, such as being passed by reference to a device routine or it's value is used outside of the compute region.

Related

Long cuMemToHostAlloc call after exiting a kernel with copyout

I am accelerating a Fortran code with OpenACC. When I profile the program with NVIDIA Nsight, I noticed the first call of a kernel with a copyout clause exhibited a long call to cuMemToHostAlloc.
Here is a trivial example illustrating this. The program launches successively 10 times a kernel that computes an array test and returns its value:
program test
implicit none
real, allocatable :: test(:)
integer :: i, j, n, m
n = 1000
m = 10
allocate(test(n))
do j = 1, m
!$acc kernels copyout(test)
!$acc loop independent
do i = 1, n
test(i) = real(i)
end do
!$acc end kernels
end do
deallocate(test)
end program test
The code is compiled with NVHPC 22.7, using no optimization flag (adding such flags did not have any influence).
The profiling of the code gives:
Compared to the actual memory transfer time, as seen for the 9 other calls, the call to cuMemToHostAlloc is ridiculously long.
If I remove the copyout clause, the call to cuMemToHostAlloc disappears, so this is related to copying back data from the device, but I do not understand why it only happens once and for so long.
Also, the test array is already allocated on the host memory.
Am I missing something?
It's the call to create the pinned memory buffers used to transfer the data between the host and device. DMA transfer must use non-swappable, i.e. pinned, memory.
We use a double buffering system where as one buffer is being filled with the virtual memory, the second buffer is transferred asynchronously to the device. Effectively hiding much of the virtual to pinned memory copy.
The host pinned memory allocation is relatively expensive but only occurs once when the runtime first encounters a data region so the cost will be amortized.
Note by removing the copyout, you're removing the need to transfer the data and hence no need for the buffers.

OMP : Is REDUCTION mandatory for an array when in a DO loop there is no concurrency between cells?

I wrote the following program that works as expected.
program main
use omp_lib
implicit none
integer :: i,j
integer, dimension(1:9) :: myArray
myArray=0
do j=0,3
!$OMP PARALLEL DO DEFAULT(SHARED)
do i=1,9
myArray(i) = myArray(i) + i*(10**j)
end do
!$OMP END PARALLEL DO
enddo
print *, myArray
end program main
As only one thread writes to the i th cell of myArray, REDUCTION on myArray has not been used. But, I wonder if REDUCTION (+ : myArray) must be added in the OMP DO or if it is unuseful ? In others terms, what is important : the array or the cell of the array ?
What is important : the array or the cell of the array ?
Cells. !$OMP PARALLEL DO DEFAULT(SHARED) is fine as long as the loop is embarrassingly parallel. When Threads operate on the same location in memory, this cause a race condition. When a reduction is performed cell-wise, then REDUCTION (+ : myArray) can be added. That being said, once should note that the full array will likely be replicated in each threads temporary before doing the final reduction of the whole array. Since your loop is embarrassingly parallel, REDUCTION (+ : myArray) is not needed here.
Besides, note that the number of iterations is too small for multiple threads to make the code faster. In fact, it will be slower because of the time spent to create/join threads, the time to distribute the work amongst threads and also because of an effect called false-sharing (which will nearly serialize the execution of the loop here).

How to collect data for each thread OpenMP

I'm new to OpenMP and try to sort out the issue of collecting data from threads. I study the example of applying OpenMP on Monte-Carlo method (square of a circle inscribed into a square).
I understood how the following code works:
unsigned pointsInside = 0;
#pragma omp parallel for num_threads(threadNum) shared(threadNum) reduction(+: pointsInside)
for (unsigned i = 0; i < threadNum; i++) { ... }
Am I right that originally pointsInsideis a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
But the main question is how to collect information directly into an array or vector? I tried to declare array or vector and provide pointer or address into OpenMP via shared and collect information for each thread at corresponding index. But it works slower than the way with the variable and reduction. Such the approach with vector or array is needed for me for my current project. Thanks a lot!
UPD:
When I said above that "it works slower" I meant comparison of two realizations of the Monte-Carlo method: 1) via shared and a vector/array, and 2) via a scalar variable and reduction. The first case is faster. My guess and question about it below.
I would like to rephrase my question more clear. I create a vector/array and provide it into OpenMP via shared. I want to collect data for each thread at corresponding index in vector/array. Under this approach I don't need any synchronization of access to the vector/array. Is it true that OpenMP enable synchronization by default when I use shared. If it is so, then how to disable it. Or may be another approaches exist. If it is not so, then how to share vector/array into the parallel part correctly and without synchronization of access.
I'd like to apply this technique for my project where I want to sort through different permutations in parallel part, collect each permutation and scalar result for it outside of the parallel part. Then sort the results and choose the best one.
A partial answer:
Am I right that originally pointsInside is a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
I think it is better to think of pointsInside as a scalar. When the parallel region starts the run-time takes care of creating individual scalars, perhaps you might think of them as myPointsInside, one such scalar for each thread. When the parallel region finishes the run-time reduces the values of all the thread scalars onto the original scalar pointsInside. This is just about what OpenMP actually does behind the scenes.
As to the rest of your question:
Yes, you can perform reductions onto arrays - but this was only added to OpenMP, for C and C++ programs, in OpenMP 4.5 (I think). What goes on is much the same as for the scalar case. This Q&A provides some assistance - Is it possible to do a reduction on an array with openmp?
As to the speed, it's difficult to answer that without a much clearer understanding of what comparisons you are making. But it's very easy to write parallel reductions on arrays which incur a significant penalty in performance from the phenomenon of false sharing, about which you may wish to inform yourself.

Not able to achieve desired speed up using openmp

I am trying to use openmp directive to parallelize a piece of code but not being able to achieve any speed up. Folowing is the piece of code that I am trying to parllelize:
#pragma omp parallel private(i,j) shared(a,x,n) default(none)
{
for(j=n-1;j>=0;j--)
{
x[j] = A(j,n,n)/A(j,j,n);
#pragma omp for schedule(dynamic)
for (i=0;i<=j-1;i++)
{
A(i,n,n )= A(i,n,n) - A(i,j,n)*x[j];
}
}
}
The value of n is 1000. The A(i,n,n) is defined macro which is used to access to array a.
As I increase the number of threads the execution time increases or it remains the same. The machine I am working on has 4 cores. I am suprised that that there is no speed up even when the number of threads is 2.
I am not able to figure what am I doing wrong?
Since n>>#CPUs (I don't think you have 1000 CPUs), it is not wise to parallelize the inner loop. In your example, you redistribute the work at each iteration.
Instead, it is wiser to parallelize the outer loop. This way, the value of x[j] won't be updated concurrently by different threads (as Zulan mentioned), and you will have much less work re-distribution.
In that case, using dynamic scheduling is wise since the quantity of work change at each iteration.
Note: You will have to change the order of the calculation, the current implementation does not allow you to move the parallelization to the outer loop since all of the threads will update the same value (A[i][n][n]).
Although it is true that threads creating take time, the threads are not recreated at each iteration. They are only created once on the top #pargma statement and running concurrently for the entire following clause.

Openmp: How to collect an array from different threads?

I am a OpenMP newbie and I am stuck at a problem! I have an array which is summed in a loop but I am having problems parallelizing this. Can you suggest how to do this? The main loop is sketched as follows:
REAL A (N) ! N is an integer
!$OMP PARALLEL
DO I=1,10 ! just an illustration
DO J=1,10 ! can't post real code sorry, illegal
!$OMP DO
DO K=1,100
call messy_subroutine_that_sums_A(I,J, K, A) ! each thread returns its own sum
! each thread returns its own A right? and needs to summed quickly
END DO
!$OMP END DO
END DO
END DO
SUBROUTINE messy_subroutine_that_sums_A(I,J, K, A)
REAL A(N) ! an array
! basically adds some stuff to the previous value of A
A=A+junk ! this is really junk
END SUBROUTINE messy_subroutine_that_sums_A
my problem is that all my attempts to collect A from all the threads have failed. If you notice A is summed over outer loops as well. What is the correct and a fast procedure to collect A from all the arrays as a sum. Secondly, my question is not just a Fortran question, it applies equally to C and C++. It is a conceptual question.
Actually OpenMP does support reduction on arrays in C, C++ (declared recently in OpenMP 4.1 comment draft release), and of cause in Fortran.
Not all implementations may support this feature, so you would better first check if you compiler supports it. To see if you have correct calculations you can start with placing A=A+junk into critical section:
!$omp critical
A=A+junk
!$omp end critical
This should give you same correct answer in all threads after the OMP DO loop.
Then you can optimize performance using array reduction instead of critical on OMP DO loop, again having same correct answer in all threads after the loop.
Then you can further optimize performance moving the reduction to OMP PARALLEL, but you won't be able to check values in all threads in this case, because all threads will work with private copies of the array A, and thus have different partial values. Final correct answer will be available in master thread only, after the OMP PARALLEL.
Note, you don't need to declare loop iteration variables private in Fortran, as they should be automatically made private.