Packed storage with BLAS function - fortran

I want to calculate A*x with A lower triangle matrix and x the vector. For example:
1 0 0
A = 2 4 0
3 5 6
with packed storage
A = (/ 1, 2, 3, 4, 5, 6/)
and
X = (/1, 1, 1/)
Now I want to do A*x with BLAS function, shoud I tranform A back to be a 3x3 matrix? If not, could you please give me some hint? (I know in memory of fortran array, A is contiguously stored)

solved by checking: http://www.icl.utk.edu/~mgates3/docs/lapack.html
program main
implicit none
integer :: n
real*8, allocatable, dimension(:) :: x
real*8, allocatable, dimension(:) :: A
n = 3
allocate(A(n*(n+1)/2))
allocate(x(n))
A = 1.0d0
x = 1.0d0
! x will be updated as A*x
call dtpmv('L', 'N', 'N', n, A, x, 1)
deallocate(A)
deallocate(x)
end program main

Related

Search for odd indices using the section method

I don't understand how I can implement the search for even elements using the section method.
I did a search for odd elements, but I need to find even ones
I need it to output 2 4 6 8, but my program outputs 1 3 5 7 9
program main
implicit none
integer, DIMENSION(3, 3) :: array = reshape((/ 1, 2, 3, 4, 5, 6, 7, 8, 9 /), shape(array))
integer :: i = 0
integer, allocatable :: B(:)
B = [(Array(::2,i), i=1, 3)]
print *, B
end program main
If you're interested in the elements with odd indices, you want
B = [(Array(modulo(i,2)+1::2, i), i=1, 3)]
modulo(i,2)+1 is 2 when i is odd, and 1 when i is even. This means that for columns with odd i you select every other element starting at the second element, and for columns with even i you select every other element starting at the first element.
If instead you're interested in selecting the odd values from an arbitrary array, you can't do this with a simple slice, and you instead need a conditional filter. For example,
B = [integer::]
do i=1,3
do j=1,3
if (modulo(Array(j,i),2)==0) then
B = [B, Array(j,i)]
endif
enddo
enddo

MPI_scatterv return MPI_ERROR_TRUNCATE in fortran

What s wrong with the following code? It works fine when I run it with 1,2 and 5 cpus,
while with 3 cpus, it breaks at (1), and with 4 cpus, it breaks at (2). In both the cases the error reads:
An error occurred in MPI_Scatterv
reported by process [3248685057,3]
on communicator MPI_COMM_WORLD
MPI_ERR_TRUNCATE: message truncated.
Probably the problem is correlated with memory allocation, but I cannot understand it deeply...(I am not totally confident about how I allocated the distributed variables (3)) Some suggestion is warmly accepted
program import_and_divide
implicit none
include 'mpif.h'
integer :: i, k, io, n ,nnz
integer,allocatable, dimension(:) :: Ai, Aj
real*8, allocatable, dimension(:) :: Aa, Ab, x
integer*4 :: rank, mpi_stat, size
integer, allocatable, dimension(:) :: sendcounts, displ, sendcounts1, displ1
integer :: n_distro
integer, allocatable, dimension(:) :: ia, ja
real*8, allocatable, dimension(:) :: a, b
n = 5
nnz = 13
!.. Initialize MPI.
call MPI_INIT(mpi_stat)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, mpi_stat)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, mpi_stat)
if (rank.eq.0) then
allocate(Ai(n+1), Aj(nnz), Aa(nnz), Ab(n), x(n))
Ai = (/0, 3, 5, 8, 11, 13/)
Aj = (/0, 1, 3, 0, 1, 2, 3, 4, 0, 2, 3, 1, 4/)
Aa = (/1, -1, -3, -2 , 5, 4, 6, 4, -4, 2, 7, 8, -5/)
Ab(:) = ([1, 2, 3, 4, 5])
x(:) = 0.d0
print*, 'Number of cpus: ', size
print*, 'n, nnz: ', n, nnz
endif
call MPI_BARRIER(MPI_COMM_WORLD, mpi_stat)
allocate(sendcounts(size), displ(size), sendcounts1(size), displ1(size))
n_distro=(n+1)/size
k = 0
do i=1,size
if (i<size) then
sendcounts(i) = n_distro
else
sendcounts(i) = (n+0)-(size-1)*n_distro
endif
displ(i) = k
k = k + sendcounts(i)
end do
if (rank.eq.0) then
displ1 = Ai(displ+1)
do i=1,size-1
sendcounts1(i) = displ1(i+1)-displ1(i)
end do
sendcounts1(size) = nnz-displ1(size)
endif
call MPI_BARRIER(MPI_COMM_WORLD,mpi_stat)
if (rank.eq.0) then
print*, 'Sendcounts: ', sendcounts
print*, 'Displ: ', displ
endif
call MPI_BCAST(displ1,shape(displ1), MPI_INT,0,MPI_COMM_WORLD,mpi_stat)
call MPI_BCAST(sendcounts1, shape(sendcounts1), MPI_INT,0,MPI_COMM_WORLD,mpi_stat)
do i=0,size-1
if (i.eq.rank) then
allocate(b(sendcounts(i+1)), ia(sendcounts(i+1)+1) ,&
a(sendcounts1(i+1)), ja(sendcounts1(i+1))) !(3)
ia(:) = rank
ja(:) = 0
b(:) = 0.d0
a(:) = 0.d0
end if
call MPI_BARRIER(MPI_COMM_WORLD,mpi_stat)
end do
call MPI_scatterv(Ab, sendcounts, displ, MPI_DOUBLE,&
b, sendcounts, MPI_DOUBLE,0, MPI_COMM_WORLD,mpi_stat) ! (2) breaks here with mpirun -np 4
call MPI_scatterv(Ai, sendcounts+1, displ, MPI_INT,&
ia, sendcounts+1, MPI_INT,0, MPI_COMM_WORLD,mpi_stat)
call MPI_scatterv(Aa, sendcounts1, displ1, MPI_DOUBLE,&
a, sendcounts1, MPI_DOUBLE,0, MPI_COMM_WORLD,mpi_stat)
call MPI_scatterv(Aj, sendcounts1, displ1, MPI_INT,&
ja, sendcounts1, MPI_INT,0, MPI_COMM_WORLD,mpi_stat) ! (1) breaks here with mpirun -np 3
call MPI_FINALIZE(mpi_stat)
end
When I replace include "mpif.h" with use mpi and I got the following errors:
There is no specific subroutine for the generic ‘mpi_bcast’
There is no specific subroutine for the generic ‘mpi_scatterv’.
I also add -fcheck=all and this option provides this additional info (when compiled with mpif.h):
Allocatable actual argument 'ab' is not allocated
but this do not clarify my idea.
If the arrays Ai, Aj, Ab, Aa are allocated on all the processors, I got the same behaviour (works with 1,2 and 5 cpus and breaks with 3 and 4 cpus) and disappear the message Allocatable actual argument 'ab' is not allocated
If I replace sendcounts and sendcounts1 in the second half of MPI_scatterv with sendcounts(rank+1) and sendcounts1(rank+1), the code seems to work fine only if I remove the compiler flag -fcheck=all/-fsanitize=address(if the arrays Ai,Aj,Aa,Ab are allocated on all the processes as well as if allocated only on one process (rank=0)). With one of these two options I got the following error:
==25041==ERROR: LeakSanitizer: detected memory leaks

Value of variable for every single iteration

This is my code:
Program Arrays_0
Implicit none
Integer :: i , Read_number , Vig_Position , Vipg_Position , n_iter
Integer , parameter :: Br_gra = 12
Integer , parameter , dimension ( Br_gra ) :: Vig = [ ( i , i = 1 , Br_gra) ]
Integer , parameter , dimension ( Br_gra ) :: Vipg = [ 0 , 1 , 1 , 1 , 2 , 2 , 3 , 4 , 4 , 7 , 7 , 7 ]
Integer :: Result_of_calculation
Write(*,*)"Enter the number (From 1 to Br_gra):"
Read(*,*) Read_number
Vig_Position = Vig(Read_number)
Vipg_Position = Vipg(Vig_Position)
n_iter = 0
Result_of_calculation = Vig_Position
Do while( Vipg_Position .ne. Vipg(1) )
n_iter = n_iter + 1
Vig_Position = Vipg_Position
Result_of_calculation = Result_of_calculation + Vig_Position
Vipg_Position = Vipg(Vig_Position)
End Do
Write(*,'(a,1x,i0)')"The number of iteration is:",n_iter
Write(*,'(a,1x,i0)')"The result of calculation is:",Result_of_calculation
End Program Arrays_0
Intention is to get value in every iteration for a variables:
Vig_Position , Result_of_calculation and Vipg_position.
How to declare variables for that kind of calculation?
In general, is there other method for counting a number of iteration?
How to declare variables in function of number of iteration befoure the code set that number like result of calculation?
Now that the question has been clarified, here's a typical way of solving the problem in Fortran. It isn't the only possible way, but it is the most general. The strategy in routine resize to double the old size is reasonable - you want to minimize the number of times this is called. The data set in the sample program is small, so to show the effect I allocated the array very small to begin with. In reality, you would want a reasonably large initial allocation (say, 100 at least).
Note the use of an internal procedure that inherits the type vals_t from its host.
Program Arrays_0
Implicit none
Integer :: i , Read_number , Vig_Position , Vipg_Position , n_iter
Integer , parameter :: Br_gra = 12
Integer , parameter , dimension ( Br_gra ) :: Vig = [ ( i , i = 1 , Br_gra) ]
Integer , parameter , dimension ( Br_gra ) :: Vipg = [ 0 , 1 , 1 , 1 , 2 , 2 , 3 , 4 , 4 , 7 , 7 , 7 ]
Integer :: Result_of_calculation
! Declare a type that will hold one iteration's values
type vals_t
integer Vig_Position
integer Vipg_Position
integer Result_of_calculation
end type vals_t
! Declare an allocatable array to hold the values
! Initial size doesn't matter, but should be close
! to a lower limit of possible sizes
type(vals_t), allocatable :: vals(:)
allocate (vals(2))
Write(*,*)"Enter the number (From 1 to Br_gra):"
Read(*,*) Read_number
Vig_Position = Vig(Read_number)
Vipg_Position = Vipg(Vig_Position)
n_iter = 0
Result_of_calculation = Vig_Position
Do while( Vipg_Position .ne. Vipg(1) )
n_iter = n_iter + 1
Vig_Position = Vipg_Position
Result_of_calculation = Result_of_calculation + Vig_Position
Vipg_Position = Vipg(Vig_Position)
! Do we need to make vals bigger?
if (n_iter > size(vals)) call resize(vals)
vals(n_iter) = vals_t(Vig_Position,Vipg_Position,Result_of_calculation)
End Do
Write(*,'(a,1x,i0)')"The number of iteration is:",n_iter
Write(*,'(a,1x,i0)')"The result of calculation is:",Result_of_calculation
! Now vals is an array of size(vals) of the sets of values
! For demonstration, print the size of the array and the values
Write(*,'(a,1x,i0)')"Size of vals is:", size(vals)
Write(*,'(3i7)') vals(1:n_iter)
contains
! Subroutine resize reallocates the array passed to it
! with double the current size, copies the old data to
! the new array, and transfers the allocation to the
! input array
subroutine resize(old_array)
type(vals_t), allocatable, intent(inout) :: old_array(:)
type(vals_t), allocatable :: new_array(:)
! Allocate a new array at double the size
allocate (new_array(2*size(old_array)))
write (*,*) "Allocated new array of size ", size(new_array)
! Copy the data
new_array(1:size(old_array)) = old_array
! Transfer the allocation to old_array
call MOVE_ALLOC (FROM=new_array, TO=old_array)
! new_array is now deallocated
return
end subroutine resize
End Program Arrays_0
Sample output:
Enter the number (From 1 to Br_gra):
12
Allocated new array of size 4
The number of iteration is: 3
The result of calculation is: 23
Size of vals is: 4
7 3 19
3 1 22
1 0 23

Fortran loop for discrete values

Is there any way to run a loop for discrete values of a variable? What about in some latest version?
Something like
for i in 1 5 9 11 31 77
used in Unix shell script?
integer, dimension (5) :: indx = [5, 9, 11, 31, 71]
do i=1, size(indx)
j=indx(i)
....
end do
You can also use an implied do loop to accomplish this, but you'll have to define the array of values as above:
integer, dimension (5) :: indx = [5, 9, 11, 31, 71]
integer, dimension (5) :: rslt
integer, external :: func
rslt = (/ func(indx(j)), j=1,5 /)
I am not sure if this helps, but you can use arrays for indeces
program Console1
implicit none
! Variables
INTEGER :: X(4) = (/ 1, 3, 5, 7 /)
REAL :: Y(10) = 0.0
! Body of Console1
print *, X
! 1 3 5 7
Y(X) = 10.0/X
print *, Y
! 10.0 0.0 3.33 0.0 2.00 0.0 1.428 0.0 ...
end program Console1

Sending 2D arrays in Fortran with MPI_Gather

I want to send 2d chunks of data using MPI_GATHER. For example: I have 2x3 arrays on each node and I want 8x3 array on root, if I have 4 nodes. For 1d arrays, MPI_GATHER sorts data according to MPI ranks, but for 2d data it creates a mess!
What is the clean way to put chunks in order?
I expected the output of this code:
program testmpi
use mpi
implicit none
integer :: send (2,3)
integer :: rec (4,3)
integer :: ierror,my_rank,i,j
call MPI_Init(ierror)
MPI_DATA_TYPE type_col
! find out process rank
call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
if (my_rank==0) then
send=1
do i=1,2
print*,(send(i,j),j=1,3)
enddo
endif
if (my_rank==1) then
send=5
! do 1,2
! print*,(send(i,j),j=1,3)
! enddo
endif
call MPI_GATHER(send,6,MPI_INTEGER,rec,6,MPI_INTEGER,0,MPI_COMM_WORLD,ierror)
if (my_rank==0) then
print*,'<><><><><>rec'
do i=1,4
print*,(rec(i,j),j=1,3)
enddo
endif
call MPI_Finalize(ierror)
end program testmpi
to be something like this :
1 1 1
1 1 1
5 5 5
5 5 5
but it looks like this:
1 1 5
1 1 5
1 5 5
1 5 5
The following a literal Fortran translation of this answer. I had thought this was unnecessary, but the multiple differences in array indexing and memory layout might mean that it is worth doing a Fortran version.
Let me start by saying that you generally don't really want to do this - scatter and gather huge chunks of data from some "master" process. Normally you want each task to be chugging away at its own piece of the puzzle, and you should aim to never have one processor need a "global view" of the whole data; as soon as you require that, you limit scalability and the problem size. If you're doing this for I/O - one process reads the data, then scatters it, then gathers it back for writing, you'll want eventually to look into MPI-IO.
Getting to your question, though, MPI has very nice ways of pulling arbitrary data out of memory, and scatter/gathering it to and from a set of processors. Unfortunately that requires a fair number of MPI concepts - MPI Types, extents, and collective operations. A lot of the basic ideas are discussed in the answer to this question -- MPI_Type_create_subarray and MPI_Gather .
Consider a 1d integer global array that task 0 has that you want to distribute to a number of MPI tasks, so that they each get a piece in their local array. Say you have 4 tasks, and the global array is [0,1,2,3,4,5,6,7]. You could have task 0 send four messages (including one to itself) to distribute this, and when it's time to re-assemble, receive four messages to bundle it back together; but that obviously gets very time consuming at large numbers of processes. There are optimized routines for these sorts of operations - scatter/gather operations. So in this 1d case you'd do something like this:
integer, dimension(8) :: global ! only root has this
integer, dimension(2) :: local ! everyone has this
integer, parameter :: root = 0
integer :: rank, comsize
integer :: i, ierr
call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
if (rank == root) then
global = [ (i, i=1,8) ]
endif
call MPI_Scatter(global, 2, MPI_INTEGER, & ! send everyone 2 ints from global
local, 2, MPI_INTEGER, & ! each proc recieves 2 into
root, & ! sending process is root,
MPI_COMM_WORLD, ierr) ! all procs in COMM_WORLD participate
After this, the processors' data would look like
task 0: local:[1,2] global: [1,2,3,4,5,6,7,8]
task 1: local:[3,4] global: [garbage]
task 2: local:[5,6] global: [garbage]
task 3: local:[7,8] global: [garbage]
That is, the scatter operation takes the global array and sends contiguous 2-int chunks to all the processors.
To re-assemble the array, we use the MPI_Gather() operation, which works exactly the same but in reverse:
local = local + rank
call MPI_Gather (local, 2, MPI_INTEGER, & ! everyone sends 2 ints from local
global, 2, MPI_INTEGER, & ! root receives 2 ints each proc into global
root, & ! receiving process is root,
MPI_COMM_WORLD, ierr) ! all procs in COMM_WORLD participate
And now the arrays look like:
task 0: local:[1,2] global: [1,2,4,5,7,8,10,11]
task 1: local:[4,5] global: [garbage-]
task 2: local:[7,8] global: [garbage-]
task 3: local:[10,11] global: [garbage-]
Gather brings all the data back.
What happens if the number of data points doesn't evenly divide the number of processes, and we need to send different numbers of items to each process? Then you need a generalized version of scatter, MPI_Scatterv, which lets you specify the counts for each processor, and displacements -- where in the global array that piece of data starts. So let's say with the same 4 tasks you had an array of characters [a,b,c,d,e,f,g,h,i] with 9 characters, and you were going to assign every process two characters except the last, that got three. Then you'd need
character, dimension(9) :: global
character, dimension(3) :: local
integer, dimension(4) :: counts
integer, dimension(4) :: displs
if (rank == root) then
global = [ (achar(i+ichar('a')), i=0,8) ]
endif
local = ['-','-','-']
counts = [2,2,2,3]
displs = [0,2,4,6]
mycounts = counts(rank+1)
call MPI_Scatterv(global, counts, displs, & ! proc i gets counts(i) chars from displs(i)
MPI_CHARACTER, &
local, mycounts, MPI_CHARACTER, & ! I get mycounts chars into
root, & ! root rank does sending
MPI_COMM_WORLD, ierr) ! all procs in COMM_WORLD participate
Now the data looks like
task 0: local:"ab-" global: "abcdefghi"
task 1: local:"cd-" global: *garbage*
task 2: local:"ef-" global: *garbage*
task 3: local:"ghi" global: *garbage*
You've now used scatterv to distribute the irregular amounts of data. The displacement in each case is two*rank (measured in characters; the displacement is in unit of the types being sent for a scatter or received for a gather; it's not generally in bytes or something) from the start of the array, and the counts are [2,2,2,3]. If it had been the first processor we wanted to have 3 characters, we would have set counts=[3,2,2,2] and displacements would have been [0,3,5,7]. Gatherv again works exactly the same but reverse; the counts and displs arrays would remain the same.
Now, for 2D, this is a bit trickier. If we want to send 2d sublocks of a 2d array, the data we're sending now no longer is contiguous. If we're sending (say) 3x3 subblocks of a 6x6 array to 4 processors, the data we're sending has holes in it:
2D Array
---------
|000|222|
|000|222|
|000|222|
|---+---|
|111|333|
|111|333|
|111|333|
---------
Actual layout in memory
[000111000111000111222333222333222333]
(Note that all high-performance computing comes down to understanding the layout of data in memory.)
If we want to send the data that is marked "1" to task 1, we need to skip three values, send three values, skip three values, send three values, skip three values, send three values. A second complication is where the subregions stop and start; note that region "1" doesn't start where region "0" stops; after the last element of region "0", the next location in memory is partway-way through region "1".
Let's tackle the first layout problem first - how to pull out just the data we want to send. We could always just copy out all the "0" region data to another, contiguous array, and send that; if we planned it out carefully enough, we could even do that in such a way that we could call MPI_Scatter on the results. But we'd rather not have to transpose our entire main data structure that way.
So far, all the MPI data types we've used are simple ones - MPI_INTEGER specifies (say) 4 bytes in a row. However, MPI lets you create your own data types that describe arbitrarily complex data layouts in memory. And this case -- rectangular subregions of an array -- is common enough that there's a specific call for that. For the 2-dimensional case we're describing above,
integer :: newtype;
integer, dimension(2) :: sizes, subsizes, starts
sizes = [6,6] ! size of global array
subsizes = [3,3] ! size of sub-region
starts = [0,0] ! let's say we're looking at region "0"
! which begins at offset [0,0]
call MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_FORTRAN, MPI_INTEGER, newtype, ierr)
call MPI_Type_commit(newtype, ierr)
This creates a type which picks out just the region "0" from the global array. Note that even in Fortran, the start parameter is given as an offset (eg, 0-based) from the start of the array, not an index (eg, 1-based).
We could send just that piece of data now to another processor
call MPI_Send(global, 1, newtype, dest, tag, MPI_COMM_WORLD, ierr) ! send region "0"
and the receiving process could receive it into a local array. Note that the receiving process, if it's only receiving it into a 3x3 array, can not describe what it's receiving as a type of newtype; that no longer describes the memory layout, because there aren't big skips between the end of one row and the start of the next. Instead, it's just receiving a block of 3*3 = 9 integers:
call MPI_Recv(local, 3*3, MPI_INTEGER, 0, tag, MPI_COMM_WORLD, ierr)
Note that we could do this for other sub-regions, too, either by creating a different type (with different start array) for the other blocks, or just by sending starting from the first location of the particular block:
if (rank == root) then
call MPI_Send(global(4,1), 1, newtype, 1, tag, MPI_COMM_WORLD, ierr)
call MPI_Send(global(1,4), 1, newtype, 2, tag, MPI_COMM_WORLD, ierr)
call MPI_Send(global(4,4), 1, newtype, 3, tag, MPI_COMM_WORLD, ierr)
local = global(1:3, 1:3)
else
call MPI_Recv(local, 3*3, MPI_INTEGER, 0, tag, MPI_COMM_WORLD, rstatus, ierr)
endif
Now that we understand how to specify subregions, there's only one more thing to discuss before using scatter/gather operations, and that's the "size" of these types. We couldn't just use MPI_Scatter() (or even scatterv) with these types yet, because these types have an extent of 15 integers; that is, where they end is 15 integers after they start -- and where they end doesn't line up nicely with where the next block begins, so we can't just use scatter - it would pick the wrong place to start sending data to the next processor.
Of course, we could use MPI_Scatterv() and specify the displacements ourselves, and that's what we'll do - except the displacements are in units of the send-type size, and that doesn't help us either; the blocks start at offsets of (0,3,18,21) integers from the start of the global array, and the fact that a block ends 15 integers from where it starts doesn't let us express those displacements in integer multiples at all.
To deal with this, MPI lets you set the extent of the type for the purposes of these calculations. It doesn't truncate the type; it's just used for figuring out where the next element starts given the last element. For types like these with holes in them, it's frequently handy to set the extent to be something smaller than the distance in memory to the actual end of the type.
We can set the extent to be anything that's convenient to us. We could just make the extent 1 integer, and then set the displacements in units of integers. In this case, though, I like to set the extent to be 3 integers - the size of a sub-column - that way, block "1" starts immediately after block "0", and block "3" starts immediately after block "2". Unfortunately, it doesn't quite work as nicely when jumping from block "2" to block "3", but that can't be helped.
So to scatter the subblocks in this case, we'd do the following:
integer(kind=MPI_ADDRESS_KIND) :: extent
starts = [0,0]
sizes = [6, 6]
subsizes = [3, 3]
call MPI_Type_create_subarray(2, sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_INTEGER, &
newtype, ierr)
call MPI_Type_size(MPI_INTEGER, intsize, ierr)
extent = 3*intsize
call MPI_Type_create_resized(newtype, 0, extent, resizedtype, ierr)
call MPI_Type_commit(resizedtype, ierr)
Here we've created the same block type as before, but we've resized it; we haven't changed where the type "starts" (the 0) but we've changed where it "ends" (3 integers). We didn't mention this before, but the MPI_Type_commit is required to be able to use the type; but you only need to commit the final type you actually use, not any intermediate steps. You use MPI_Type_free to free the committed type when you're done.
So now, finally, we can scatterv the blocks: the data manipulations above are a little complicated, but once it's done, the scatterv looks just like before:
counts = 1 ! we will send one of these new types to everyone
displs = [0,1,6,7] ! the starting point of everyone's data
! in the global array, in block extents
call MPI_Scatterv(global, counts, displs, & ! proc i gets counts(i) types from displs(i)
resizedtype, &
local, 3*3, MPI_INTEGER, & ! I'm receiving 3*3 int
root, MPI_COMM_WORLD, ierr) !... from (root, MPI_COMM_WORLD)
And now we're done, after a little tour of scatter, gather, and MPI derived types.
An example code which shows both the gather and the scatter operation, with character arrays, follows. Running the program:
$ mpirun -np 4 ./scatter2d
global array is:
000222
000222
000222
111333
111333
111333
Rank 0 received:
000
000
000
Rank 1 received:
111
111
111
Rank 2 received:
222
222
222
Rank 3 received:
333
333
333
Rank 0 sending:
111
111
111
Rank 1 sending:
222
222
222
Rank 2 sending:
333
333
333
Rank 3 sending:
444
444
444
Root received:
111333
111333
111333
222444
222444
222444
and the code follows:
program scatter
use mpi
implicit none
integer, parameter :: gridsize = 6 ! size of array
integer, parameter :: procgridsize = 2 ! size of process grid
character, allocatable, dimension (:,:) :: global, local
integer, dimension(procgridsize**2) :: counts, displs
integer, parameter :: root = 0
integer :: rank, comsize
integer :: localsize
integer :: i, j, row, col, ierr, p, charsize
integer, dimension(2) :: sizes, subsizes, starts
integer :: newtype, resizedtype
integer, parameter :: tag = 1
integer, dimension(MPI_STATUS_SIZE) :: rstatus
integer(kind=MPI_ADDRESS_KIND) :: extent, begin
call MPI_Init(ierr)
call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierr)
call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
if (comsize /= procgridsize**2) then
if (rank == root) then
print *, 'Only works with np = ', procgridsize**2, ' for now.'
endif
call MPI_Finalize(ierr)
stop
endif
localsize = gridsize/procgridsize
allocate( local(localsize, localsize) )
if (rank == root) then
allocate( global(gridsize, gridsize) )
forall( col=1:procgridsize, row=1:procgridsize )
global((row-1)*localsize+1:row*localsize, &
(col-1)*localsize+1:col*localsize) = &
achar(ichar('0')+(row-1)+(col-1)*procgridsize)
end forall
print *, 'global array is: '
do i=1,gridsize
print *, global(i,:)
enddo
endif
starts = [0,0]
sizes = [gridsize, gridsize]
subsizes = [localsize, localsize]
call MPI_Type_create_subarray(2, sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_CHARACTER, &
newtype, ierr)
call MPI_Type_size(MPI_CHARACTER, charsize, ierr)
extent = localsize*charsize
begin = 0
call MPI_Type_create_resized(newtype, begin, extent, resizedtype, ierr)
call MPI_Type_commit(resizedtype, ierr)
counts = 1 ! we will send one of these new types to everyone
forall( col=1:procgridsize, row=1:procgridsize )
displs(1+(row-1)+procgridsize*(col-1)) = (row-1) + localsize*procgridsize*(col-1)
endforall
call MPI_Scatterv(global, counts, displs, & ! proc i gets counts(i) types from displs(i)
resizedtype, &
local, localsize**2, MPI_CHARACTER, & ! I'm receiving localsize**2 chars
root, MPI_COMM_WORLD, ierr) !... from (root, MPI_COMM_WORLD)
do p=1, comsize
if (rank == p-1) then
print *, 'Rank ', rank, ' received: '
do i=1, localsize
print *, local(i,:)
enddo
endif
call MPI_Barrier(MPI_COMM_WORLD, ierr)
enddo
local = achar( ichar(local) + 1 )
do p=1, comsize
if (rank == p-1) then
print *, 'Rank ', rank, ' sending: '
do i=1, localsize
print *, local(i,:)
enddo
endif
call MPI_Barrier(MPI_COMM_WORLD, ierr)
enddo
call MPI_Gatherv( local, localsize**2, MPI_CHARACTER, & ! I'm sending localsize**2 chars
global, counts, displs, resizedtype,&
root, MPI_COMM_WORLD, ierr)
if (rank == root) then
print *, ' Root received: '
do i=1,gridsize
print *, global(i,:)
enddo
endif
call MPI_Type_free(newtype,ierr)
if (rank == root) deallocate(global)
deallocate(local)
call MPI_Finalize(ierr)
end program scatter
So that's the general solution. For your particular case, where we are just appending by rows, we don't need a Gatherv, we can just use a gather, because in this case, all of the displacements are the same -- before, in the 2d block case we had one displacement going 'down', and then jumps in that displacement as you went 'across' to the next column of blocks. Here, the displacement is always one extent from the previous one, so we don't need to give displacements explicitly. So a final code looks like:
program testmpi
use mpi
implicit none
integer, dimension(:,:), allocatable :: send, recv
integer, parameter :: nsendrows = 2, nsendcols = 3
integer, parameter :: root = 0
integer :: ierror, my_rank, comsize, i, j, ierr
integer :: blocktype, resizedtype
integer, dimension(2) :: starts, sizes, subsizes
integer (kind=MPI_Address_kind) :: start, extent
integer :: intsize
call MPI_Init(ierror)
call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierror)
call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierror)
allocate( send(nsendrows, nsendcols) )
send = my_rank
if (my_rank==root) then
! we're going to append the local arrays
! as groups of send rows
allocate( recv(nsendrows*comsize, nsendcols) )
endif
! describe what these subblocks look like inside the full concatenated array
sizes = [ nsendrows*comsize, nsendcols ]
subsizes = [ nsendrows, nsendcols ]
starts = [ 0, 0 ]
call MPI_Type_create_subarray( 2, sizes, subsizes, starts, &
MPI_ORDER_FORTRAN, MPI_INTEGER, &
blocktype, ierr)
start = 0
call MPI_Type_size(MPI_INTEGER, intsize, ierr)
extent = intsize * nsendrows
call MPI_Type_create_resized(blocktype, start, extent, resizedtype, ierr)
call MPI_Type_commit(resizedtype, ierr)
call MPI_Gather( send, nsendrows*nsendcols, MPI_INTEGER, & ! everyone send 3*2 ints
recv, 1, resizedtype, & ! root gets 1 resized type from everyone
root, MPI_COMM_WORLD, ierr)
if (my_rank==0) then
print*,'<><><><><>recv'
do i=1,nsendrows*comsize
print*,(recv(i,j),j=1,nsendcols)
enddo
endif
call MPI_Finalize(ierror)
end program testmpi
Running this with 3 processes gives:
$ mpirun -np 3 ./testmpi
<><><><><>recv
0 0 0
0 0 0
1 1 1
1 1 1
2 2 2
2 2 2
Here's another code block for any other struggling Fortran beginners out there like myself. It shows two different ways to achieve a MPI_Gather on a nx * ny array divided into M * N blocks, with one block on each process.
One way uses MPI derived datatypes, the other simply sends 1D raw data and sorts if afterward on the main node.
Both produce an ordered M * N 2D array. For the example of 3 x 2 sub-arrays in x * y, each having 4 x 5 elements:
n mpi ranks: 6
rank 0 has topology coords 0,0
rank 1 has topology coords 0,1
rank 2 has topology coords 1,0
rank 3 has topology coords 1,1
rank 4 has topology coords 2,0
rank 5 has topology coords 2,1
1 1 1 1 3 3 3 3 5 5 5 5
1 1 1 1 3 3 3 3 5 5 5 5
1 1 1 1 3 3 3 3 5 5 5 5
1 1 1 1 3 3 3 3 5 5 5 5
1 1 1 1 3 3 3 3 5 5 5 5
0 0 0 0 2 2 2 2 4 4 4 4
0 0 0 0 2 2 2 2 4 4 4 4
0 0 0 0 2 2 2 2 4 4 4 4
0 0 0 0 2 2 2 2 4 4 4 4
0 0 0 0 2 2 2 2 4 4 4 4
Note that if the raw data is sent and not re-arranged it would be ordered by default like so:
5 5 5 5 5 5 5 5 5 5 5 5
4 4 4 4 5 5 5 5 5 5 5 5
4 4 4 4 4 4 4 4 4 4 4 4
3 3 3 3 3 3 3 3 4 4 4 4
3 3 3 3 3 3 3 3 3 3 3 3
2 2 2 2 2 2 2 2 2 2 2 2
1 1 1 1 2 2 2 2 2 2 2 2
1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0
The following code can be compiled with
mpif90 -Wall test.F90 -o test.out
and run with
mpirun -n 6 test.out
program main
use mpi
use, intrinsic :: iso_fortran_env
use iso_c_binding
implicit none
! ===
integer(int64) i, j
integer(int32) nx , ny
integer(int32) nxr , nyr
integer(int32) npx , npy
integer(int32) ri , rj
integer, dimension(2) :: mpi_coords, mpi_coords2
integer rank, n_ranks, ierror
integer rank_cart
integer comm, comm2d
real(real64), dimension(:,:), allocatable :: A, A_global
real(real64), dimension(:), allocatable :: B_global
integer(int32), dimension(:), allocatable :: lengths, displacements
integer(int32) subarraytype, resizedtype
integer(int32) dblsize
integer(kind=MPI_ADDRESS_KIND) :: start, extent
! === MPI interface initialization
call MPI_INIT(ierror)
comm = MPI_COMM_WORLD
call MPI_COMM_SIZE(comm, n_ranks, ierror)
call MPI_COMM_RANK(comm, rank, ierror)
if (rank .eq. 0) then
print '(a, i0)', 'n mpi ranks: ', n_ranks
end if
npx = 3 !! n processes in x
nxr = 4 !! n pts per rank in x
nx = npx*nxr !! n pts x total
npy = 2 !! n processes in y
nyr = 5 !! n pts per rank in y
ny = npy*nyr !! n pts y total
! === check that n_ranks are equal to the hardcoded number of processes in [x,y]
if (n_ranks .ne. npx*npy) then
if (rank .eq. 0) then
print '(a)','n_ranks != npx*npy'
end if
call MPI_Abort(MPI_COMM_WORLD, -1, ierror)
call MPI_Finalize(ierror)
end if
call MPI_BARRIER(comm, ierror)
! === create 2D Cartesian grid
call MPI_Cart_create(comm, 2, (/npx,npy/), (/.true.,.false./), .true., comm2d, ierror)
! === get rank in 2D communicator
call MPI_Comm_rank(comm2d, rank_cart, ierror)
! === get this rank ID and coordinates within the 2D topology
call MPI_Cart_coords(comm2d, rank_cart, 2, mpi_coords, ierror)
! === print topology
if (.true.) then
print '(a,i0,a,i0,a,i0)', 'rank ', rank_cart, ' has topology coords ', mpi_coords(1), ',', mpi_coords(2)
end if
call MPI_BARRIER(comm, ierror)
call flush(6)
! === populate data
allocate( A(nxr,nyr) )
A(:,:) = real(rank,real64)
! ===
if (.true.) then !! use MPI derived types
allocate (lengths(n_ranks))
allocate (displacements(n_ranks))
call MPI_Type_create_subarray(2, (/nx,ny/), (/nxr,nyr/), (/0,0/), &
MPI_ORDER_FORTRAN, MPI_DOUBLE_PRECISION, subarraytype, ierror)
call MPI_Type_size(MPI_DOUBLE_PRECISION, dblsize, ierror)
start = 0
extent = nxr*dblsize
call MPI_Type_create_resized(subarraytype, start, extent, resizedtype, ierror)
call MPI_Type_commit(resizedtype, ierror)
lengths = 1
!! displacements = [ npx*0*nyr+0, npx*1*nyr+0, &
!! npx*0*nyr+1, npx*1*nyr+1, &
!! npx*0*nyr+2, npx*1*nyr+2 ] !! for 3x2
!! displacements = [ npx*0*nyr+0, npx*1*nyr+0, npx*2*nyr+0, &
!! npx*0*nyr+1, npx*1*nyr+1, npx*2*nyr+1 ] !! for 2x3
do i=0,n_ranks-1
call MPI_Cart_coords(comm2d, int(i,int32), 2, mpi_coords2, ierror)
ri = mpi_coords2(1)
rj = mpi_coords2(2)
displacements(i+1) = npx*rj*nyr + ri
end do
if (rank .eq. 0) then
allocate( A_global(nx,ny) )
end if
call MPI_Gatherv( A , nxr*nyr, MPI_DOUBLE_PRECISION, &
A_global, lengths, displacements, resizedtype, &
0, comm2d, ierror )
else !! send raw data in 1D, then re-arrange
if (rank .eq. 0) then
allocate( A_global(nx,ny) )
allocate( B_global(nx*ny) )
end if
call MPI_Gather( A , nxr*nyr, MPI_DOUBLE_PRECISION, &
B_global, nxr*nyr, MPI_DOUBLE_PRECISION, &
0, comm2d, ierror )
if (rank .eq. 0) then
if (.true.) then !! re-arrange data
do i=0,n_ranks-1
call MPI_Cart_coords(comm2d, int(i,int32), 2, mpi_coords2, ierror)
ri = mpi_coords2(1)
rj = mpi_coords2(2)
A_global( (ri+0)*nxr+1:(ri+1)*nxr+1 , &
(rj+0)*nyr+1:(rj+1)*nyr+1 ) &
= &
reshape( B_global( (i+0)*nxr*nyr+1 : (i+1)*nxr*nyr+1 ) , (/nxr,nyr/) )
end do
else !! dont do re-arrange
A_global(:,:) = reshape( B_global , (/nx,ny/) )
end if
end if
if (rank .eq. 0) then
deallocate( B_global )
end if
end if
! ===
!! print the array to the terminal
if (rank .eq. 0) then
do j=ny, 1, -1
do i=1, nx
write(*,'(i0,a)',advance='no') int(A_global(i,j)) , " "
end do
write (*,*) ''
end do
end if
!! save data to binary file
if (rank==0) then
open(3, file=trim("A.dat"), access="stream")
write(3) reshape( A_global , (/nx,ny/) )
close(3)
end if
if (rank .eq. 0) then
deallocate( A_global )
end if
! ===
call MPI_BARRIER(comm, ierror)
call MPI_FINALIZE(ierror)
end program main
A final note: pulling all data to one process (for I/O purposes or otherwise) is typically very bad practice for parallel programming. The moment this is done, the code will lose potential scalability. The method shown here is really only fit for smaller codes or for debugging. For I/O purposes, one should seriously look into collective MPI-I/O or libraries such as HDF5 which allow for collective I/O.