Matmul in OpenACC Fortran loop - fortran

Accelerating a Fortran code with OpenACC using the PGI compiler, I got problems with a matmul call in an accelerated loop.
In the simplified example, I apply the identity matrix on two vectors, so the input and the output values should be the same:
program test
implicit none
integer :: a(3, 3)
integer :: v1(3, 2), v2(3, 2)
integer :: i
a = reshape([1, 0, 0, 0, 1, 0, 0, 0, 1], [3, 3])
v1 = reshape([1, 2, 3, 4, 5, 6], [3, 2])
print *, v1
!$acc kernels copyin(a, v1) copyout(v2)
!$acc loop independent
do i = 1, 2
v2(:, i) = matmul(a, v1(:, i))
enddo
!$acc end kernels
print *, v2
endprogram
When compiling with the PGI compiler version 20.9, I got these information:
test:
12, Generating copyin(a(:,:),v1(:,:)) [if not already present]
Generating implicit copyout(z_a_0(:)) [if not already present]
Generating copyout(v2(:,:)) [if not already present]
14, Loop is parallelizable
Generating Tesla code
14, !$acc loop gang ! blockidx%x
15, !$acc loop vector(32) ! threadidx%x
15, Loop is parallelizable
Running the code gives the following values:
1 2 3 4 5 6
4 5 6 4 5 6
the second line should be like the first one, which is the case on sequential execution. What is wrong in the code?

Looks to be a compiler issue. The problem line being:
Generating implicit copyout(z_a_0(:))
"z_a_0" is compiler temp array being created to hold the intermediary result from the call to matmul. It's declaration is being hoisted out of the loop and then copied back in as shared array. Since it's shared, it then causes a race condition.
I've submitted a problem report (TPR #29482) and sent it to our engineers for further investigation.

#Mat Colgrove explained the reason of the incorrect behavior. The workaround I found was to write the matrix vector multiplication explicitly:
program test
implicit none
integer :: a(3, 3)
integer :: v1(3, 2), v2(3, 2)
integer :: i, j, k
a = reshape([1, 0, 0, 0, 1, 0, 0, 0, 1], [3, 3])
v1 = reshape([1, 2, 3, 4, 5, 6], [3, 2])
print *, v1
!$acc kernels copyin(a, v1) copyout(v2)
!$acc loop independent
do i = 1, 2
!$acc loop seq
do k = 1, 3
v2(k, i) = 0
!$acc loop seq
do j = 1, 3
v2(k, i) = v2(k, i) + a(j, k) * v1(j, i)
enddo
enddo
enddo
!$acc end kernels
print *, v2
endprogram

Related

Search for odd indices using the section method

I don't understand how I can implement the search for even elements using the section method.
I did a search for odd elements, but I need to find even ones
I need it to output 2 4 6 8, but my program outputs 1 3 5 7 9
program main
implicit none
integer, DIMENSION(3, 3) :: array = reshape((/ 1, 2, 3, 4, 5, 6, 7, 8, 9 /), shape(array))
integer :: i = 0
integer, allocatable :: B(:)
B = [(Array(::2,i), i=1, 3)]
print *, B
end program main
If you're interested in the elements with odd indices, you want
B = [(Array(modulo(i,2)+1::2, i), i=1, 3)]
modulo(i,2)+1 is 2 when i is odd, and 1 when i is even. This means that for columns with odd i you select every other element starting at the second element, and for columns with even i you select every other element starting at the first element.
If instead you're interested in selecting the odd values from an arbitrary array, you can't do this with a simple slice, and you instead need a conditional filter. For example,
B = [integer::]
do i=1,3
do j=1,3
if (modulo(Array(j,i),2)==0) then
B = [B, Array(j,i)]
endif
enddo
enddo

MPI_scatterv return MPI_ERROR_TRUNCATE in fortran

What s wrong with the following code? It works fine when I run it with 1,2 and 5 cpus,
while with 3 cpus, it breaks at (1), and with 4 cpus, it breaks at (2). In both the cases the error reads:
An error occurred in MPI_Scatterv
reported by process [3248685057,3]
on communicator MPI_COMM_WORLD
MPI_ERR_TRUNCATE: message truncated.
Probably the problem is correlated with memory allocation, but I cannot understand it deeply...(I am not totally confident about how I allocated the distributed variables (3)) Some suggestion is warmly accepted
program import_and_divide
implicit none
include 'mpif.h'
integer :: i, k, io, n ,nnz
integer,allocatable, dimension(:) :: Ai, Aj
real*8, allocatable, dimension(:) :: Aa, Ab, x
integer*4 :: rank, mpi_stat, size
integer, allocatable, dimension(:) :: sendcounts, displ, sendcounts1, displ1
integer :: n_distro
integer, allocatable, dimension(:) :: ia, ja
real*8, allocatable, dimension(:) :: a, b
n = 5
nnz = 13
!.. Initialize MPI.
call MPI_INIT(mpi_stat)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, mpi_stat)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, mpi_stat)
if (rank.eq.0) then
allocate(Ai(n+1), Aj(nnz), Aa(nnz), Ab(n), x(n))
Ai = (/0, 3, 5, 8, 11, 13/)
Aj = (/0, 1, 3, 0, 1, 2, 3, 4, 0, 2, 3, 1, 4/)
Aa = (/1, -1, -3, -2 , 5, 4, 6, 4, -4, 2, 7, 8, -5/)
Ab(:) = ([1, 2, 3, 4, 5])
x(:) = 0.d0
print*, 'Number of cpus: ', size
print*, 'n, nnz: ', n, nnz
endif
call MPI_BARRIER(MPI_COMM_WORLD, mpi_stat)
allocate(sendcounts(size), displ(size), sendcounts1(size), displ1(size))
n_distro=(n+1)/size
k = 0
do i=1,size
if (i<size) then
sendcounts(i) = n_distro
else
sendcounts(i) = (n+0)-(size-1)*n_distro
endif
displ(i) = k
k = k + sendcounts(i)
end do
if (rank.eq.0) then
displ1 = Ai(displ+1)
do i=1,size-1
sendcounts1(i) = displ1(i+1)-displ1(i)
end do
sendcounts1(size) = nnz-displ1(size)
endif
call MPI_BARRIER(MPI_COMM_WORLD,mpi_stat)
if (rank.eq.0) then
print*, 'Sendcounts: ', sendcounts
print*, 'Displ: ', displ
endif
call MPI_BCAST(displ1,shape(displ1), MPI_INT,0,MPI_COMM_WORLD,mpi_stat)
call MPI_BCAST(sendcounts1, shape(sendcounts1), MPI_INT,0,MPI_COMM_WORLD,mpi_stat)
do i=0,size-1
if (i.eq.rank) then
allocate(b(sendcounts(i+1)), ia(sendcounts(i+1)+1) ,&
a(sendcounts1(i+1)), ja(sendcounts1(i+1))) !(3)
ia(:) = rank
ja(:) = 0
b(:) = 0.d0
a(:) = 0.d0
end if
call MPI_BARRIER(MPI_COMM_WORLD,mpi_stat)
end do
call MPI_scatterv(Ab, sendcounts, displ, MPI_DOUBLE,&
b, sendcounts, MPI_DOUBLE,0, MPI_COMM_WORLD,mpi_stat) ! (2) breaks here with mpirun -np 4
call MPI_scatterv(Ai, sendcounts+1, displ, MPI_INT,&
ia, sendcounts+1, MPI_INT,0, MPI_COMM_WORLD,mpi_stat)
call MPI_scatterv(Aa, sendcounts1, displ1, MPI_DOUBLE,&
a, sendcounts1, MPI_DOUBLE,0, MPI_COMM_WORLD,mpi_stat)
call MPI_scatterv(Aj, sendcounts1, displ1, MPI_INT,&
ja, sendcounts1, MPI_INT,0, MPI_COMM_WORLD,mpi_stat) ! (1) breaks here with mpirun -np 3
call MPI_FINALIZE(mpi_stat)
end
When I replace include "mpif.h" with use mpi and I got the following errors:
There is no specific subroutine for the generic ‘mpi_bcast’
There is no specific subroutine for the generic ‘mpi_scatterv’.
I also add -fcheck=all and this option provides this additional info (when compiled with mpif.h):
Allocatable actual argument 'ab' is not allocated
but this do not clarify my idea.
If the arrays Ai, Aj, Ab, Aa are allocated on all the processors, I got the same behaviour (works with 1,2 and 5 cpus and breaks with 3 and 4 cpus) and disappear the message Allocatable actual argument 'ab' is not allocated
If I replace sendcounts and sendcounts1 in the second half of MPI_scatterv with sendcounts(rank+1) and sendcounts1(rank+1), the code seems to work fine only if I remove the compiler flag -fcheck=all/-fsanitize=address(if the arrays Ai,Aj,Aa,Ab are allocated on all the processes as well as if allocated only on one process (rank=0)). With one of these two options I got the following error:
==25041==ERROR: LeakSanitizer: detected memory leaks

Extract the minor matrix from a 3x3 based on input i,j

For a given 3x3 matrix, for example:
A = [3 1 -4 ; 2 5 6 ; 1 4 8]
If I need the minor matrix for entry (1,2)
Minor = [2 6 ; 1 8]
I already wrote a program to read in the matrix from a text file, and I am supposed to write a subroutine to extract the minor matrix from the main matrix A based on the user inputs for i,j. I am very new to Fortran and have no clue how to do that. I made some very desperate attempts but I am sure there is a cleaner way to do that.
I got so desperate I wrote 9 if functions for each possible combination of i and j but that clearly is not a smart way for doing this. Any help is appreciated!
One way to do this is, as #HighPerformanceMark said in the comment, with vector subscripts. You can declare an array with the rows you want to keep, and the same for columns, and pass them as indices to your matrix. Like this:
function minor(matrix, i, j)
integer, intent(in) :: matrix(:,:), i, j
integer :: minor(size(matrix, 1) - 1, size(matrix, 2) - 1)
integer :: rows(size(matrix, 1) - 1), cols(size(matrix, 2) - 1), k
rows = [(k, k = 1, i - 1), (k, k = i + 1, size(rows))]
cols = [(k, k = 1, j - 1), (k, k = j + 1, size(cols))]
minor = matrix(rows, cols)
end
(I didn't test it yet, so tell me if there is any error)
Another option would be constructing a new matrix from 4 assignments, one for each quadrant of the result (limited by the excluded row/column).
I like the first option more because it is more scalable. You could easily extend the function to remove multiple rows/columns by passing arrays as arguments, or adapt it to work on higher dimensions.
You can use an ac-implied-do and RESHAPE to construct a mask of the parts of the matrix you want to preserve and then zap the rest with pack and reassemble with RESHAPE.
program minor
implicit none
integer A(3,3)
integer, allocatable :: B(:,:)
character(20) fmt
integer i, j
A = reshape([ &
3, 1, -4, &
2, 5, 6, &
1, 4, 8], &
shape(A), order = [2,1])
write(fmt,'(*(g0))') '(a/',size(A,2),'(i3))'
write(*,fmt) 'A =',transpose(A)
B = reshape(pack(A,reshape([((all([i,j]/=[1,2]),i=1,size(A,1)), &
j=1,size(A,2))],shape(A))),shape(A)-1)
write(fmt,'(*(g0))') '(a/',size(B,2),'(i3))'
write(*,fmt) 'B =',transpose(B)
end program minor
Output:
A =
3 1 -4
2 5 6
1 4 8
B =
2 6
1 8

GDB and Fortran modules

I have the following Fortran 95 code:
MODULE ISSUE
IMPLICIT NONE
CONTAINS
SUBROUTINE PROBLEM(A)
IMPLICIT NONE
DOUBLE PRECISION, DIMENSION(:,:), INTENT(INOUT) :: A
INTEGER :: i, n
n = SIZE(A, 2)
DO i = 1, n
PRINT *, A(i, 1:n)
ENDDO
END SUBROUTINE PROBLEM
END MODULE ISSUE
PROGRAM TEST
USE ISSUE
IMPLICIT NONE
DOUBLE PRECISION, DIMENSION(5, 5) :: A
A = TRANSPOSE(RESHAPE((/ 1, 2, 3, 4, 5, &
6, 7, 8, 9, 0, &
1, 2, 3, 4, 5, &
6, 7, 8, 9, 0, &
1, 2, 3, 4, 5/), SHAPE(A)))
CALL PROBLEM(A)
END PROGRAM TEST
And now I compile it using
gfortran -g -O0 problem.f95 -o problem
Then I run the program using GDB 7.7.1 and I set a breakpoint to line 11, (DO i = 1, n). After that, I print the first element of the matrix A. The output that I get is:
(gdb) break 11
Breakpoint 1 at 0x4008c5: file problem.f95, line 11.
(gdb) run
Starting program: /path_to_problem_folder/problem
Breakpoint 1, issue::problem (a=...) at problem.f95:11
warning: Source file is more recent than executable.
11 DO i = 1, n
(gdb) p A(1, 1)
$1 = 6.9533558074105031e-310
I expect the output 1, but 6.9533558074105031e-310 is printed by gdb. Note that the subroutine PROBLEM prints all values correctly.
Why this happens and how can I print the value A(1, 1) correctly using GDB?
I updated my GDB version from 7.7.1 to 7.11.1 and the problem disappeared.

How to index an array with a mask in Fortran [duplicate]

I would like to do something like this in Fortran:
program where
real :: a(6) = (/ 4, 5, 6, 7, 8, 9 /)
print *, a(a>7)
end program
In Python I would typically do this with NumPy like this:
import numpy
a = numpy.array([ 4, 5, 6, 7, 8, 9])
print a[numpy.where(a>7)]
#or
print a[a>7]
I've played around, but nothing has worked thus far, but I'm guessing it is fairly simple.
I'll extend slightly the answer by #VladimirF as I suspect you don't want to limit yourself to the exact print example.
a>7 returns a logical array corresponding to a with .true. at index where the condition is met, .false. otherwise. The pack intrinsic takes such a mask and returns an array with those elements with .true. in the mask.
However, you can do other things with the mask which may fit under your numpy.where desire. For example, there is the where construct (and where statement) and the merge intrinsic. Further you can use pack again with the mask to get the indices and do more involved manipulations.
program where
real :: a(6) = (/ 4, 5, 6, 7, 8, 9 /)
print *, pack(a,a>7)
end program
You can find a related topic here: Better way to mask a Fortran array?
I think both where and merge can do the task.
In python, where has the ability to assign different value according to the mask, for example
a = np.array([4, 5, 6, 7, 8, 9])
b = np.where(a>7, 1, -1)
b will be array([-1, -1, -1, -1, 1, 1])
In Fortran, the equivalent of this is merge
real :: a(6) = (/ 4, 5, 6, 7, 8, 9 /)
real, allocatable :: b(:)
b = merge(1,-1,a>7)
print*, b
end
The MERGE function chooses alternative values based on the value of a mask. http://www.lahey.com/docs/lfpro78help/F95ARMERGEFn.htm
where can also do this, but it is slightly more complicated.
real :: a(6) = (/ 4, 5, 6, 7, 8, 9 /)
real, allocatable :: b(:)
b = a
where (a>7)
b = 1
else where
b = -1
end where
print*, b
end
a short version is this
b = a
b = -1
where (a>7) b = 1
You can find more information of where here: http://www.personal.psu.edu/jhm/f90/statements/where.html