matrix multiplication in fortran using mpi and mkl - fortran

The problem I am having is mainly related to MPI. I have developed the code to multiply two matrices using mpi and mkl in fortran. It is giving the correct output but the issue is that if I increase the numberof processors for the calculation then the time taken to calcualte gets increase as well, i.e. it takes 36 seconds to multiply matrices using 4 processors but it takes 58 seconds for 5 processors. And I have come to know that processor with rank 1 takes more time than others inspit of the weight is almost same for all the processors. i.e. if NP=4 then rank=1 takes same amount of time(Screenshot), others are taking but for NP=5, rank=1 is taking more time than others(Screenshot).
use mkl_service
include 'mpif.h'
! variables
!------mpi initialization---!
!----someoperation---!
if(rank.eq.root)then
! some task then
do i=1,nworkers-1
call mpi_send(nobsvld, 1, mpi_int, i, 3, mpi_comm_world, ierr)
end do
! multiplication routine called
call matrixMultiply(dble(p), dble(htra), phtra, nsea, nsea, nobsvld, nworkers, mpi_comm_world, rank, mpi_int, mpi_real8)
call matrixMultiply(dble(hvld), phtra, hphtra, nobsvld, nsea, nobsvld, nworkers, mpi_comm_world, rank, mpi_int, mpi_real8)
! some task then
call matrixMultiply(gmm, inovvld, gain, nobsvld, nobsvld, 1, nworkers, mpi_comm_world, rank, mpi_int, mpi_real8)
call matrixMultiply(phtra, gain, gainres, nsea, nobsvld, 1, nworkers, mpi_comm_world, rank, mpi_int, mpi_real8)
! some task then
else
!---Worker---!
call mpi_recv(nobsvld, 1, mpi_int, 0, 3, mpi_comm_world, stats, ierr)
call recv_cal_send(nsea, nsea, nobsvld, mpi_comm_world, mpi_int, mpi_real8,rank)
call recv_cal_send(nobsvld, nsea, nobsvld, mpi_comm_world, mpi_int, mpi_real8,rank)
call recv_cal_send(nobsvld, nobsvld, 1, mpi_comm_world, mpi_int, mpi_real8,rank)
call recv_cal_send(nsea, nobsvld, 1, mpi_comm_world, mpi_int, mpi_real8,rank)
end if
call mpi_finalize(ierr)
end program
!-------------------------------------routines-----------------------------------!
!-------------------routine for root to send as well as calculate some of the portion and receive from workers--------------------!
subroutine matrixMultiply(A, B, C, ax, ay, by, n, comm, rank, dt_int, dt_real)
integer :: ax, ay, by, rows, averows, extra, e_row, offset, source, ierr
integer, dimension(5) :: stats !mpi_status_size==5
real*8 :: A(ax,ay), B(ay,by), C(ax,by)
real*8, allocatable :: A_buff(:,:), C_buff(:,:)
!-------------------------------------send portions to workers-------------------------------------!
offset=1
averows=ax/n
extra=modulo(ax,n)
if(extra.gt.0)then
offset=offset+averows+1
else
offset=offset+averows
end if
do i=1,n-1
if(extra.gt.i)then
rows=averows+1
else
rows=averows
end if
e_row = offset+rows-1
call mpi_send(rows, 1, dt_int, i, i, comm, ierr)
call mpi_send(offset, 1, dt_int, i, i, comm, ierr)
call mpi_send(B, ay*by, dt_real, i, i, comm, ierr)
!call cpu_time(s1)
allocate(A_buff(rows,ay))
A_buff(1:rows,1:ay)=A(offset:e_row,1:ay)
call mpi_send(A_buff, rows*ay, dt_real, i, i, comm, ierr)
!call cpu_time(s2)
!print*,s2-s1," to send to",i
offset=offset+rows
deallocate(A_buff)
end do
!-------------------------------------calculate the portion of itself-------------------------------------!
if(extra.gt.0)then
rows=averows+1
else
rows=averows
end if
allocate(A_buff(rows, ay))
A_buff(1:rows,1:ay)=A(1:rows,1:ay)
allocate(C_buff(rows,by))
call cpu_time(c1)
call dgemm('N','N', rows, by, ay, 1.d0, A_buff, rows, B, ay, 0.d0, C_buff, rows)
call cpu_time(c2)
print*,c2-c1," for calculation by root"
deallocate(A_buff)
C(1:rows,1:by)=C_buff(1:rows,1:by)
deallocate(C_buff)
!-------------------------------------receive calculated portions from workers -------------------------------------!
do i=1,n-1
source=i
call mpi_recv(rows, 1, dt_int, source, source, comm, stats, ierr)
call mpi_recv(offset, 1, dt_int, source, source, comm, stats, ierr)
e_row = offset+rows-1
allocate(C_buff(rows,by))
call mpi_recv(C_buff, rows*by, dt_real, source, source, comm, stats, ierr)
C(offset:e_row,1:by)=C_buff(1:rows,1:by)
deallocate(C_buff)
end do
end subroutine
!---------routine to receive portion from root to calculate and send back----------!
subroutine recv_cal_send(ax, ay, by, comm, dt_int, dt_real,rank)
integer :: comm, rank, dt_int, dt_real, ierr, offset
integer :: source, ax, ay, bx, by, cy, rows
integer, dimension(5) :: stats !mpi_status_size==5
real*8, dimension(ay,by) :: B
real*8, allocatable :: A_buffer(:,:)
real*8, allocatable :: buff(:,:)
bx=ay
cy=by
source = 0
call mpi_recv(rows, 1, dt_int, source, rank, comm, stats, ierr)
call mpi_recv(offset, 1, dt_int, source, rank, comm, stats, ierr)
call mpi_recv(B, bx*by, dt_real, source, rank, comm, stats, ierr)
allocate(A_buffer(rows,ay))
call mpi_recv(A_buffer, rows*ay, dt_real, source, rank, comm, stats, ierr)
allocate(buff(rows,cy))
call cpu_time(c1)
call dgemm('N','N', rows, by, bx , 1.d0, A_buffer, rows, B, bx, 0.d0, buff, rows)
call cpu_time(c2)
print*,c2-c1," for calculation by",rank
deallocate(A_buffer)
!call cpu_time(s1)
call mpi_send(rows, 1, dt_int, 0, rank, comm, ierr)
call mpi_send(offset, 1, dt_int, 0, rank, comm, ierr)
call mpi_send(buff, rows*cy,dt_real, 0, rank, comm, ierr)
!call cpu_time(s2)
!print*,s2-s1," cal sent by",rank
deallocate(buff)
end subroutine

(Note lacking a complete example program I'm having to make a few assumptions about what is going on here - should a full example be provided I will modify as appropriate. Note also there may be an issue with the number of real cores you have in your system, but as I understand it your solution his real problems that are independent of this)
I am afraid that the way you have written this is inherently non-scalable, and it's non-scalable in two different ways; it is not scalable in time and, worse to my mind, it is not scalable in memory. The former is the cause behind the poor performance you are observing, the latter means that using more processes will not allow you to solve bigger problems.
To try and understand what I mean by the former let's try and make a little model of the time taken by the solution above when running on P processes for NxN matrices. The time taken for the compute that performs the matrix multiplication will be proportional to N**3/P , which is a nicely decreasing function of time as P increases, good. But you also have to pass messages, and in particular you seem to send all of B from rank 0 to all the other processes. The time for this will, as the code is written, be proportional to P*N**2 (ignoring latency), which is an increasing function of time with P - the more processes you use, the slower the communication becomes. Thus the total time is a sum of the (decreasing) compute time with the (increasing) communication time - and for sufficiently large P this will tend to infinity, not zero; this is what I mean by inherently non-scalable in time. Now the problem you have in your case is the compute is quite quick, so at least for the N you are using it only requires quite modest P before the communication time dominates, and the time to solution will increase approximately linearly with the number of processes.
You can marginally improve this by using mpi_bcast (and mpi_gatherv for the final result). This will change the communication time to Log(P)*N**2, but it's still increasing, and so the algorithm is still inherently non-scalable. And more generally my experience is that these master-slave algorithms are rarely worth investigating at all if you want performance - only if the compute is ferociously expensive do they have any worth.
The above could be helped if you increase N - this makes the compute time more expensive compared to the communication time and so you are more likely to see the decrease with P due to the former, than the increase due to the latter at least to a larger value of P; at some point the communication time will always take over. But because you hold all the matrices on all processes you can't solve any bigger a problem on P processes than you can on 1. This is what I mean by inherently non-scalable in memory.
A good solution to this will address both problems, and if you really want a good solution both issues are coupled and must both be addressed - you must use a properly distributed memory solution if you want good parallel scaling in time. For matrix multiplication an example algorithm is Cannon's Algorithm which, in the spirit of the little model above, has compute time proportional to N**3/P and communication time to N**2/Sqrt(P) - note these are both decreasing functions of P, so the time to solution here tends to zero; this is a scalable algorithm in time. Further the memory use per processes is proportional to N**2/P, again a decreasing function of P, and so this algorithm is also scaling in memory. But you shouldn't write this yourself - a number of parallel libraries are available which implement these scalable algorithms for matrix multiplication, I use ScaLapack (see also parallel matrix multiplication using PBLAS) and MKL has a version of this library available; there are also free versions.
Edit:
A couple of other thoughts
You have made sure you are using single threaded BLAS calls?
If you are only interested in single node parallelism use a parallel BLAS call - you will get great performance for minimal investment of effort

Related

How to use "if" in Fortran [duplicate]

I am kind of new in the fortran proramming.
Can anyone please help me out with the solution.
i am having a problem of generating integer random number
in the range [0,5] in fortran random number using
random_seed and rand
To support the answer by Alexander Vogt, I'll generalize.
The intrinsic random_number(u) returns a real number u (or an array of such) from the uniform distribution over the interval [0,1). [That is, it includes 0 but not 1.]
To have a discrete uniform distribution on the integers {n, n+1, ..., m-1, m} carve the continuous distribution up into m+1-n equal sized chunks, mapping each chunk to an integer. One way could be:
call random_number(u)
j = n + FLOOR((m+1-n)*u) ! We want to choose one from m-n+1 integers
As you can see, for the initial question for {0, 1, 2, 3, 4, 5} this reduces to
call random_number(u)
j = FLOOR(6*u) ! n=0 and m=5
and for the other case in your comment {-1, 0, 1}
call random_number(u)
j = -1 + FLOOR(3*u) ! n=-1 and m=1
Of course, other transformations will be required for sets of non-contiguous integers, and one should pay attention to numerical issues.
What about:
program rand_test
use,intrinsic :: ISO_Fortran_env
real(REAL32) :: r(5)
integer :: i(5)
! call init_random_seed() would go here
call random_number(r)
! Uniform distribution requires floor: Thanks to #francescalus
i = floor( r*6._REAL32 )
print *, i
end program

MATMUL result not equal with explicit calculation for double precision?

sorry for a seemingly stupid question. I was testing the computational efficiency when replacing for-loop operations on matrices with intrinsic functions. When I check the matrices product results of the two methods, it confused me that the two outputs were not the same. Here is the simplified code I used
program matmultest
integer,parameter::nx=64,ny=32,nz=16
real*8::mat1(nx,ny),mat2(ny,nz)
real*8::result1(nx,nz),result2(nx,nz),diff(nx,nz)
real*8::localsum
integer::i,j,m
do i=1,ny
do j=1,nx
mat1(j,i)=dble(j)/7d0+2.65d0*dble(i)
enddo
enddo
do i=1,nz
do j=1,ny
mat2(j,i)=5d0*dble(j)-dble(i)*0.45d0
enddo
enddo
do j=1,nz
do i=1,nx
localsum=0d0
do m=1,ny
localsum=localsum+mat1(i,m)*mat2(m,j)
enddo
result1(i,j)=localsum
enddo
enddo
result2=matmul(mat1,mat2)
diff=result2-result1
print*,sum(abs(diff)),maxval(diff)
end program matmultest
And the result gives
1.6705598682165146E-008 5.8207660913467407E-011
The difference is non-zero for real8 but zero when I tested for integer later. I wonder if it is because of my code's faults somewhere or the numerical accuracy of MATMUL() is single precision?
And the compiler I am using is GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Thanks!
francescalus explained that reordering of operations causes these differences. Let's try to find out how it actually happened.
A few words about matrix product
Consider matrices A(n,p), B(p,q), C(n,q) and C = A*B.
The naive approach, a variant of which you used, involves the following nested loops:
c = 0
do i = 1, n
do j = 1, p
do k = 1, q
c(i, j) = c(i, j) + a(i, k) * b(k, j)
end do
end do
end do
These loops can be executed in any of 6 orders, depending on the variable that you choose at each level. In the example above, the loop is named "ijk", and the other variants "ikj", "jik", etc. are all correct.
There is a speed difference, due to the memory cache: when the inner loop runs across contiguous memory elements, the loop is faster. That's the jki or kji cases.
Indeed, since Fortran matrices are stored in column major order, if the innermost loop runs on i, in the instruction c(i, j) = c(i, j) + a(i, k) * c(k, j), the value c(k, j) is constant, and the operation is equivalent to v(i) = v(i) + x * u(i), where the elements of vectors v and u are contiguous.
However, regarding the order of operations, there shouldn't be a difference: you can check for yourself that all elements of C are computed in the same order. At least at the "higher level": the compiler might optimize things differently, and it's where it becomes really interesting.
What about MATMUL? I believe it's usually a naive matrix product, based on the nested loops above, say a jki loop.
There are other ways to multiply matrices, that involve the Strassen algorithm to improve the algorithm complexity or blocking (i.e. computed products of submatrices) to improve cache use. Other methods that could change the result are OpenMP (i.e. multithread), or using FMA instructions. But here we are not going to delve into these methods. It's really only about the nested loops. If you are interested, there are many resources online, check this.
A few words about optimization
Three remarks first:
On a processor without SIMD instructions, you would get the same result as MATMUL (i.e. you would print zero in the end).
If you had implemented the loops as above, you would also get the same result. There is a tiny but significant difference in your code.
If you had implemented the loops as a subroutine, you would also get the same result. Here I suspect the compiler optimizer is doing some reordering, as I can't reproduce your "accumulator" code with a subroutine, at least with Intel Fortran.
Here is your implementation:
do i = 1, n
do j = 1, p
s = 0
do k = 1, q
s = s + a(i, k) * b(k, j)
end do
c(i, j) = s
end do
end do
It's also correct of course. Here, you are using an accumulator, and at the end of the innermost loop, the value of the accumulator is written in the matrix C.
Optimization is typically relevant on the innermost loop mainly. For our purpose, two "basic" instructions in the innermost loop are relevant, if we get rid of all other details:
v(i) = v(i) + x*u(i) (the jki loop)
s = s + x(k)*y(k) (the accumulator loop where y is contiguous in memory, but not x)
The first is usually called a "daxpy" (from the name of a BLAS routine), for "A X Plus Y", the "D" meaning double precision. The second one is just an accumulator.
On an old sequential processor, there is not much to be done to optimize. On a modern processor with SIMD, registers can hold several values, and computations can be done on all of them at once, in parallel. For instance, on x86, an XMM register (from SSE instruction set) can hold two double precision floating-point numbers. A YMM register (from AVX2) can hold four numbers, and a ZMM register (AVX512, found on Xeon) can hold eight numbers.
For instance, on YMM the innermost loop will be "unrolled" to deal with four vector elements at a time (or even more if using several registers).
Here is what the basic loop block is then roughly doing:
daxpy case:
Read 4 numbers from u into register YMM1
Read 4 numbers from v into register YMM2
x is constant and is kept in another register
Multiply in parallel x with YMM1, add in parallel to YMM2, put the result in YMM2
Write back the result to corresponding elements of v
The read/write part is faster if the elements are contiguous in memory, but if they are not it's still worth doing this in parallel.
Note that here, we haven't changed the execution order of additions of the high level Fortran loop.
accumulator case
For the parallelism to be useful, there will be a trick: accumulate four values in parallel in a YMM register, and then add the four accumulated values.
The basic loop block is thus doing this:
The accumulator is kept in YMM3 (four numbers)
Read 4 numbers from X into register YMM1
Read 4 numbers from Y into register YMM2
Multiply in parallel YMM1 with YMM2, add in parallel to YMM3
At the end of the innermost loop, add the four components of the accumulator, and write this back as the matrix element.
It's like if we had computed:
s1 = x(1)*y(1) + x(5)*y(5) + ... + x(29)*y(29)
s2 = x(2)*y(2) + x(6)*y(6) + ... + x(30)*y(30)
s3 = x(3)*y(3) + x(7)*y(7) + ... + x(31)*y(31)
s4 = x(4)*y(4) + x(8)*y(8) + ... + x(32)*y(32)
And then the matrix element written is c(i,j) = s1+s2+s3+s4.
Here the order of additions has changed! And then, since the order is different, the result is very likely different.
I can replicate the results when using fast math (I have Intel Fortran), and when I compile with the default /fp:fast I get the following max error and speed
! Error Loops Matmul
! 0.58208E-10 107526.9 140056.0 FAST
The error is just maxval(abs(diff)) speed measured is in # of matrix operations per second.
But when I compile with /fp:strict then I get no error, but a slowdown with the loops
! Error Loops Matmul
! 0.0000 43140.6 141844.0 STRICT
I see a -60% slowdown in the loops with strict floating-point handling, but surprisingly no slowdown with the matmul() function.
Source Code for completeness
program Console1
use iso_fortran_env
implicit none
integer,parameter :: nr = 100000
integer,parameter::nx=64,ny=32,nz=16
real(real64)::mat1(nx,ny),mat2(ny,nz)
real(real64)::result1(nx,nz),result2(nx,nz),diff(nx,nz)
real(real64)::localsum
integer::i,j,r
integer(int64) :: tic, toc, rate
real(real64) :: dt1, dt2
do i=1,ny
do j=1,nx
mat1(j,i)=dble(j)/7d0+2.65d0*dble(i)
enddo
enddo
do i=1,nz
do j=1,ny
mat2(j,i)=5d0*dble(j)-dble(i)*0.45d0
enddo
enddo
call SYSTEM_CLOCK(tic,rate)
do r=1, nr
result1=mymatmul(mat1,mat2)
end do
call SYSTEM_CLOCK(toc,rate)
dt1 = dble(toc-tic)/rate
call SYSTEM_CLOCK(tic,rate)
do r=1, nr
result2=matmul(mat1,mat2)
end do
call SYSTEM_CLOCK(toc,rate)
dt2 = dble(toc-tic)/rate
diff=result2-result1
print ('(1x,a16,1x,a16,1x,a16)'), "Error", "Loops", "Matmul"
print ('(1x,g16.5,1x,f16.1,1x,f16.1)'), maxval(abs(diff)), nr/dt1, nr/dt2
! Error Loops Matmul
! 0.58208E-10 107526.9 140056.0 FAST
! 0.0000 43140.6 141844.0 STRICT
!
contains
pure function mymatmul(a,b) result(c)
real(real64), intent(in) :: a(:,:), b(:,:)
real(real64) :: c(size(a,1), size(b,2))
integer :: i,j,k
real(real64) :: sum
do j=1, size(c,2)
do i=1, size(c,1)
sum = 0d0
do k=1, size(a,2)
sum = sum + a(i,k)*b(k,j)
end do
c(i,j) = sum
end do
end do
end function
end program Console1
Always compiled as Release-x64 and not Debug.

Summing and Gathering elements of array element-wise in MPI

After doing calculations to multiply a matrix with a vector using Cartesian topology. I got the following process with the their ranks and vectors.
P0 (process with rank = 0) =[2 , 9].
P1 (process with rank = 1) =[2 , 3]
P2 (process with rank = 2) =[1 , 9]
P3 (process with rank = 3) =[4 , 6].
Now. I need to sum the elements of the even rank processes and the odd ones separately, like this:
temp1 = [3 , 18]
temp2 = [6 , 9]
and then , gather the results in a different vector, like this:
result = [3 , 18 , 6 , 9]
My attampt to do it is to use the MPI_Reduce and then MPI_Gather like this :
// Previous code
double* temp1 , *temp2;
if(myrank %2 == 0){
BOOLEAN flag = Allocate_vector(&temp1 ,local_m); // function to allocate space for vectors
MPI_Reduce(local_y, temp1, local_n, MPI_DOUBLE, MPI_SUM, 0 , comm);
MPI_Gather(temp1, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE,0, comm);
free(temp1);
}
else{
Allocate_vector(&temp2 ,local_m);
MPI_Reduce(local_y, temp2, local_n , MPI_DOUBLE, MPI_SUM, 0 , comm);
MPI_Gather(temp2, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE, 0,comm);
free(temp2);
}
But the answer is not correct.It seemd that the code sums all elements of the even and odd process togather and then gives a segmentation fault error:
Wrong_result = [21 15 0 0]
and this error
** Error in ./test': double free or corruption (fasttop): 0x00000000013c7510 ***
*** Error in./test': double free or corruption (fasttop): 0x0000000001605b60 ***
It won't work the way you are trying to do it. To perform reduction over the elements of a subset of processes, you have to create a subcommunicator for them. In your case, the odd and the even processes share the same comm, therefore the operations are not over the two separate groups of processes but rather over the combined group.
You should use MPI_Comm_split to perform a split, perform the reduction using the two new subcommunicators, and finally have rank 0 in each subcommunicator (let's call those leaders) participate in the gather over another subcommunicator that contains those two only:
// Make sure rank is set accordingly
MPI_Comm_rank(comm, &rank);
// Split even and odd ranks in separate subcommunicators
MPI_Comm subcomm;
MPI_Comm_split(comm, rank % 2, 0, &subcomm);
// Perform the reduction in each separate group
double *temp;
Allocate_vector(&temp, local_n);
MPI_Reduce(local_y, temp, local_n , MPI_DOUBLE, MPI_SUM, 0, subcomm);
// Find out our rank in subcomm
int subrank;
MPI_Comm_rank(subcomm, &subrank);
// At this point, we no longer need subcomm. Free it and reuse the variable.
MPI_Comm_free(&subcomm);
// Separate both group leaders (rank 0) into their own subcommunicator
MPI_Comm_split(comm, subrank == 0 ? 0 : MPI_UNDEFINED, 0, &subcomm);
if (subcomm != MPI_COMM_NULL) {
MPI_Gather(temp, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE, 0, subcomm);
MPI_Comm_free(&subcomm);
}
// Free resources
free(temp);
The result will be in gResult of rank 0 in the latter subcomm, which happens to be rank 0 in comm because of the way the splits are performed.
Not as simple as expected, I guess, but that is the price of having convenient collective operations in MPI.
On a side node, in the code shown you are allocating temp1 and temp2 to be of length local_m, while in all collective calls the length is specified as local_n. If it happens that local_n > local_m, then heap corruption will occur.

C++ split integer array into chunks

I guess my question has 2 parts:
(1) Is this the right approach to send different chunks of an array to different processors?
Let's say I have n processors whose rank ranges from 0 to n-1.
I have an array of size d. I want to split this array into k equally-sized chunks. Assume d is divisible by k.
I want to send each of these chunks to a processor whose rank is less than k.
It would be easy if I can use something like MPI_Scatter, but this function sends to EVERY OTHER processors, and I only want to send to a certain number of procs.
So what I did was that I have a loop of k iterations and do k MPI_Isend's.
Is this efficient?
(2) If it is, how do I split an array into chunks? There's always the easy way, which is
int size = d/k;
int buffs[k][size];
for (int rank = 0; rank < k; ++rank)
{
for (int i = 0; i < size ++i)
buffs[rank][i] = input[rank*size + i];
MPI_Isend(&buffs[rank], size, MPI_INT, rank, 1, comm, &request);
}
What you are looking for is MPI_Scatterv which allows you to explicitly specify the length of each chunk and its position relative to the beginning of the buffer. If you don't want to send data to certain ranks, simply set the length of their chunks to 0:
int blen[n];
MPI_Aint displ[n];
for (int rank = 0; rank < n; rank++)
{
blen[rank] = (rank < k) ? size : 0;
displ[rank] = rank * size;
}
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Scatterv(input, blen, displ, MPI_INT,
mybuf, myrank < k ? size : 0, MPI_INT,
0, MPI_COMM_WORLD);
Note that for rank >= k the displacements will run past the end of the buffer. That is all right since block lengths are set to zero for rank >= k and no data will be accessed.
As for your original approach, it is not portable and might not always work. The reason is that you are overwriting the same request handle and you never wait for the sends to complete. The correct implementation is:
MPI_Request request[k];
for (int rank = 0; rank < k; ++rank)
{
MPI_Isend(&input[rank*size], size, MPI_INT, rank, 1, comm, &request[rank]);
}
MPI_Waitall(k, request, MPI_STATUSES_IGNORE);
The most optimal implementation would be to use MPI_Scatter in a subcommunicator:
MPI_Comm subcomm;
MPI_Comm_split(MPI_COMM_WORLD, myrank < k ? 0 : MPI_UNDEFINED, myrank,
&subcomm);
// Now there are k ranks in subcomm
// Perform the scatter in the subcommunicator
if (subcomm != MPI_COMM_NULL)
MPI_Scatter(input, size, MPI_INT, mybuf, size, MPI_INT, 0, subcomm);
The MPI_Comm_split call splits MPI_COMM_WORLD and creates a new communicator from all original ranks less than k. It uses the original rank as key for ordering the ranks in the new communicator, therefore rank 0 in MPI_COMM_WORLD becomes rank 0 in subcomm. Since MPI_Scatter often performs better than MPI_Scatterv, this one is the most optimal solution.

openmp issues when three do-loops are involved (fortran)

I am very confused about this problem regarding openmp in fortran. Specifically, when I write the program like this:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
i=2
j=2
l=41
!$OMP PARALLEL SHARED(M),PRIVATE(l,i,j)
!$OMP DO
DO i=1,2
DO j=1,2
DO l=0,41
M(i,j)=M(i,j)+1
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END PROGRAM TEST
After compiling by: ifort -openmp test.f90, it works well, and the results of M(1,1) is 42 as expected.
However, when I only adjust the order of sum over l and {i,j}, like the following:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
i=2
j=2
l=41
!$OMP PARALLEL SHARED(M),PRIVATE(l,i,j)
!$OMP DO
DO l=0,41
DO i=1,2
DO j=1,2
M(i,j)=M(i,j)+1
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
END PROGRAM TEST
After compiling by: ifort -openmp test.f90, it doesn't work well. In fact, when you run a.out several times, the results of M(1,1) seems to be random. Does anyone know what's the problem? Also, if I want to obtain the right results, under the summing order:
DO l=0,41
DO i=1,2
DO j=1,2
what part should I modify this code?
Many thanks for any help.
You have a race condition. Threads with different l are trying to use the same element M(i,j). You can use tools like Intel Inspector or Oracle Thread Analyzer to find it (I checked with Intel). The best thing to do is using your original order. You can also use reduction, but be careful with larger arrays:
PROGRAM TEST
IMPLICIT NONE
INTEGER :: i,j,l
INTEGER :: M(2,2)
M = 0
!$OMP PARALLEL DO PRIVATE(l,i,j),reduction(+:M)
DO l = 0, 41
DO i = 1, 2
DO j = 1, 2
M(i,j) = M(i,j) + 1
END DO
END DO
END DO
!$OMP END PARALLEL DO
print *, M
END PROGRAM
There are many problems with your approach. First of all, the missing initialization of your array M. Inside your loop, you issue
M(i,j) = M(i,j) + 1
without having given any initial value to M(i,j). So the algorithm is indeterministic even in the serial case, and it is just a matter of lack, that you obtain the right result with any specific compiler or any specific summation order.
Addintionally, if you parallelize the loop over l, like
!$OMP PARALLEL DO SHARED(M),PRIVATE(l,i,j)
DO l = 0, 41
DO i = 1, 2
DO j = 1, 2
M(i,j) = M(i,j) + 1
END DO
END DO
END DO
every thread will have an own nested loop construct over i and j covering all matrix elements. Consequently, different threads will access the same elements of the matrix at the same time. The result again being indeterministic. You could of course, try to solve the issue by ensuring via OpenMP constructs, that the threads wait on each other before accessing a certain matrix element. However, that would make the algorithm definitely too slow. The best you can do in this case, in my oppinion, to parallelize over the matrix elements (the loops over i and j).
By the way, the lines
i=2
j=2
l=41
in your code are superfluous, since you immediately use them as loop variables so that their will be overwritten anyway.