PETSC DMDA vec values are assigned to wrang place - c++

I recently started learning PETSc and encountered a problem while trying to accomplish some simple task. What is wrong with this code:
static char help[] = "Test 2d DMDAs Vecs.\n\n";
#include <petscdm.h>
#include <petscdmda.h>
#include <petscsys.h>
PetscReal process_value(int rank, int i) {
return i*PetscPowScalar(10,rank*2);
}
int main(int argc,char **argv) {
PetscErrorCode ierr;
PetscMPIInt rank;
PetscInt M = -5,N = -3;
DM da;
Vec local,global;
ierr = PetscInitialize(&argc,&argv,(char*)0,help);CHKERRQ(ierr);
ierr = MPI_Comm_rank(PETSC_COMM_WORLD,&rank);CHKERRQ(ierr);
ierr = DMDACreate2d(PETSC_COMM_WORLD , DM_BOUNDARY_NONE , DM_BOUNDARY_NONE
, DMDA_STENCIL_BOX , M , N , PETSC_DECIDE, PETSC_DECIDE
, 1 , 1 , NULL , NULL , &da); CHKERRQ(ierr);
ierr = DMCreateGlobalVector(da,&global);CHKERRQ(ierr);
ierr = DMCreateLocalVector(da,&local);CHKERRQ(ierr);
{
PetscInt v,i, j, xm, ym, xs, ys;
PetscScalar **array;
ierr = DMDAGetCorners(da, &xs, &ys, 0, &xm, &ym, 0); CHKERRQ(ierr);
PetscSynchronizedPrintf(PETSC_COMM_WORLD,"%d:xs=%d\txm=%d\tys=%d\tym=%d\n",rank,xs,xm,ys,ym);
PetscSynchronizedFlush(PETSC_COMM_WORLD,PETSC_STDOUT);
ierr = DMDAVecGetArray(da, global, &array); CHKERRQ(ierr);
v=0;
for (j = ys; j < ys + ym; j++) {
for (i = xs; i < xs + xm; i++) {
array[j][i] = process_value(rank,v+=1);
}
}
ierr = DMDAVecRestoreArray(da, global, &array); CHKERRQ(ierr);
}
ierr = VecView(global,PETSC_VIEWER_STDOUT_WORLD);CHKERRQ(ierr);
ierr = VecDestroy(&local);CHKERRQ(ierr);
ierr = VecDestroy(&global);CHKERRQ(ierr);
ierr = DMDestroy(&da);CHKERRQ(ierr);
ierr = PetscFinalize();
return 0;
}
It fills small array with quantities labeled by process rank.
After successful compilation and linking it gives the following result:
> mpiexec -n 2 ./problem
0:xs=0 xm=3 ys=0 ym=3
1:xs=3 xm=2 ys=0 ym=3
Vec Object: 2 MPI processes
type: mpi
Vec Object:Vec_0x84000004_0 2 MPI processes
type: mpi
Process [0]
1.
2.
3.
100.
200.
4.
5.
6.
300.
Process [1]
400.
7.
8.
9.
500.
600.
>
VecView shows that processes have written to places that belong to other process. Where is an error? DMDAVecGetArray/DMDAVecRestoreArray give wrong array or VecView is not suitable to view Vec obtained from DM object ?

DMDAVecGetArray() and DMDAVecRestoreArray() work perfectly fine. Indeed, you are dealing with a feature of VecView() described in the mailing list of petsc a few years ago.
As is, VecView() prints the vector from the DMDA using the natural ordering.
proc0 proc1
1 2 | 3 4
5 6 | 7 8
___________
9 10 | 11 12
13 14 | 15 16
proc2 proc3
The difference between the natural ordering and Petsc's ordering is underlined in the documentation of Petsc, in section 2.5 about structured grids. Here is what Petsc's ordering looks like:
proc0 proc1
1 2 | 5 6
3 4 | 7 8
___________
9 10 | 13 14
11 12 | 15 16
proc2 proc3
As signaled in the thread VecView doesn't work properly with DA global vectors, it is still possible to print the vector using Petsc's ordering by using:
PetscViewerPushFormat(PETSC_VIEWER_STDOUT_WORLD,PETSC_VIEWER_NATIVE);
VecView(global,PETSC_VIEWER_STDOUT_WORLD);
PetscViewerPopFormat(PETSC_VIEWER_STDOUT_WORLD);
Let's take a closer look to the source of Petsc to see how it works. The VecView() operation for DMDA vector is overloaded as the DMCreateGlobalVector_DA() function is called (see dadist.c): the new method is VecView_MPI_DA() in gr2.c. Unsurprisingly, it calls a function DMDACreateNaturalVector() and later prints the natural vector using the native VecView(). If the format PETSC_VIEWER_NATIVE is used, the vector interface calls the operation *vec->ops->viewnative, which likely points to the native VecView() function VecView_MPI_ASCII() in pdvec.c. This explains the strange ( but very practical! ) behavior of VecView for DMDA vectors.
If you wish to keep the natural ordering, you can revome the meaningless Process[0]...Process[3] that get printed by using:
PetscViewerPushFormat(PETSC_VIEWER_STDOUT_WORLD, PETSC_VIEWER_ASCII_COMMON);

Related

Fortran. do while with function checking the prime number [duplicate]

This question already has answers here:
Does Fortran preserve the value of internal variables through function and subroutine calls?
(3 answers)
Closed 2 years ago.
I'm recently studying Fortran, and trying to make a program to check the prime number. The function works fine without any loop. It can give 1 when the given number is a prime number and 0 otherwise. However, it doesn't work properly when it is used in do while loop. In the range of 2 ~ 10, it is supposed to give 1 (for 2), 1(for 2), 0(for 4), 1(for 5), 0(for 6), etc. But, it keeps showing only 0. I'm pretty new to programming, so I'm not sure what I'm missing. I know there are many answers related to prime numbers, but I don't see any issue like this.
** Function checking prime numbers **
module prime_function
contains
integer function isPrime(inp_num)
implicit none
integer :: inp_num
integer :: i = 1
integer :: temp1 = 0
do while (i < inp_num)
i = i + 1
if(mod(inp_num, i) == 0) then
exit
end if
end do
if(inp_num == i) then
temp1 = 1
else
temp1 = 0
end if
isPrime = temp1
end function
end module
program fortran_q
use prime_function
implicit none
integer :: ii, a
a = isPrime(10)
print *, "10 is prime number, so the return : ", a
a = isPrime(11)
print *, "11 is prime number, so the return : ", a
ii = 1
do while (ii < 10)
ii = ii + 1
print *, isPrime(ii)
end do
end program
** Results **
10 is prime number, so the return : 0
11 is prime number, so the return : 1
0
0
0
0
0
0
0
0
0
You have a classic issue for people new to Fortran. The initialization of i and temp0 implies the SAVE attribute. When you call isPrime for the first time the values are set to 1 and 0. On the next invocation, the values of i and temp0 are set to whatever their previous values were when isPrime was last executed. The belong program fixes the issue.
module prime_function
implicit none
private
public isprime
contains
function isPrime(inp_num) result(res)
integer res
integer, intent(in) :: inp_num
integer i, temp1
i = 1
temp1 = 0
do while (i < inp_num)
i = i + 1
if (mod(inp_num, i) == 0) exit
end do
res = 0
if (inp_num == i) res = 1
end function
end module
program fortran_q
use prime_function
implicit none
integer :: ii, a
a = isPrime(10)
print *, "10 is prime number, so the return : ", a
a = isPrime(11)
print *, "11 is prime number, so the return : ", a
ii = 1
do while (ii < 10)
ii = ii + 1
print *, isPrime(ii)
end do
end program

Counting active tasks using start time and duration in C++

The input consists of a set of tasks given in increasing order of start time, and each task has a certain duration associated.
The first line is number of tasks, for example
3
2 5
4 23
7 4
This means that there are 3 tasks. The first one starts at time 2, and ends at 7 (2+5). Second starts at 4, ends at 27. Third starts at 7, ends at 11.
We assume each task starts as soon as it is ready, and does not need to wait for a processor or anything else to free up.
This means we can keep track of number of active tasks:
Time #tasks
0 - 2 0
2 - 4 1
4 - 11 2
11 - 27 1
I need to find 2 numbers:
Max number of active tasks at any given time (2 in this case) and
Average number of active tasks over the entire duration computed here as :
[ 0*(2-0) + 1*(4-2) + 2*(11-4) + 1*(27-11) ] / 27
For this,
I have first found the time when all tasks have come to an end using the below code:
#include "stdio.h"
#include "stdlib.h"
typedef struct
{
long int start;
int dur;
} task;
int main()
{
long int num_tasks, endtime;
long int maxtime = 0;
scanf("%ld",&num_tasks);
task *t = new task[num_tasks];
for (int i=0;i<num_tasks;i++)
{
scanf("%ld %d",&t[i].start,&t[i].dur);
endtime = t[i].start + t[i].dur;
if (endtime > maxtime)
maxtime = endtime;
}
printf("%ld\n",maxtime);
}
Can this be done using Priority Queues implemented as heaps ?
Your question is rather broad, so I am going to just give you a teaser answer that will, hopefully, get you started, attempting to answer your first part of the question, with a not necessarily optimized solution.
In your toy input, you have:
2 5
4 23
7 4
thus you can compute and store in the array of structs that you have, the end time of a task, rather than its duration, for later usage. That gives as an array like this:
2 7
4 27
7 11
Your array is already sorted (because the input is given in that order) by start time, and that's useful. Use std::sort to sort the array, if needed.
Observe how you could check for the end time of the first task versus the start time of the other tasks. With the right comparison, you can determine the number of active tasks along with the first task. Checking whether the end time of the first task is greater than the start time of the second task, if true, denotes that these two tasks are active together at some point.
Then you would do the same for the comparison of the first with the third task. After that you would know how many tasks were active in relation with the first task.
Afterwards, you are going to follow the same procedure for the second task, and so on.
Putting all that together in code, we get:
#include "stdio.h"
#include "stdlib.h"
#include <algorithm>
typedef struct {
int start;
int dur;
int end;
} task;
int compare (const task& a, const task& b) {
return ( a.start < b.start );
}
int main() {
int num_tasks;
scanf("%d",&num_tasks);
task *t = new task[num_tasks];
for (int i=0;i<num_tasks;i++) {
scanf("%d %d",&t[i].start,&t[i].dur);
t[i].end = t[i].start + t[i].dur;
}
std::sort(t, t + num_tasks, compare);
for (int i=0;i<num_tasks;i++) {
printf("%d %d\n", t[i].start, t[i].end);
}
int max_noOf_tasks = 0;
for(int i = 0; i < num_tasks - 1; i++) {
int noOf_tasks = 1;
for(int j = i + 1; j < num_tasks; j++) {
if(t[i].end > t[j].start)
noOf_tasks++;
}
if(max_noOf_tasks < noOf_tasks)
max_noOf_tasks = noOf_tasks;
}
printf("Max. number of active tasks: %d\n", max_noOf_tasks);
delete [] t;
}
Output:
2 7
4 27
7 11
Max. number of active tasks: 2
Now, good luck with the second part of your question.
PS: Since this is C++, you could have used an std::vector to store your structs, rather than a plain array. That way you would avoid dynamic memory allocation too, since the vector takes care that for you automatically.

mpi_gather doesn't return entire vector with fortran derived datatype

I'm running into an issue where mpi_gather only returns a small subset of the vector that I'm trying to pass. Note, I'm running this with np 1, but it also happens with np 2 and np 3. NAT = 3 (nat = number of atoms), and there are 194 unique pairs.
To make this happen, I have two derived data types in fortran:
type dtlrdh_lut
sequence
integer p
integer q
integer ind
real(dp), dimension(3, 3) :: TLR
real(dp), dimension(3, 3, 3, 3) :: dTLRdh
end type dtlrdh_lut
In my subroutine, I have defined my variables like so:
type(dtlrdh_lut), dimension(:), allocatable :: my_dtlrdh, collected_dtlrdh
integer :: dh_dtype, dr_dtype, dh_types(5), dr_types(6), dh_blocks(5), dr_blocks(6)
INTEGER(kind=MPI_ADDRESS_KIND) :: dh_offsets(5), dr_offsets(6)
integer :: numtasks, rank, ierr, dh_displs(nproc_image), dr_displs(nproc_image)
integer :: n, status(mpi_status_size)
I split up the work between processes in another method and then count the number of elements of the lookup table need to be computed and allocated the local lookup tables on this specific node like so:
my_num_pairs = 0
do i = 1, num_pairs, 1
if(unique_pairs(i)%cpu.eq.me_image) then
my_num_pairs = my_num_pairs + 1
end if
end do
if(.not.allocated(my_dtlrdh)) allocate(my_dtlrdh(my_num_pairs))
I also allocate and zero out the lookup table that everything will be combined back into with the following code:
if(me_image.eq.root_image) then
if(.not.allocated(collected_dtlrdh)) allocate(collected_dtlrdh(num_pairs))
do i=1,my_num_pairs,1
collected_dtlrdh(i)%p = 0
collected_dtlrdh(i)%q = 0
collected_dtlrdh(i)%ind = 0
collected_dtlrdh(i)%TLR = 0.0_DP
collected_dtlrdh(i)%dTLRdh = 0.0_DP
end do
end if
I then fill in the lookup table, but I'll skip that code. It's long and not relevant. With this done, it's time to start the MPI process to gather all back on the root process.
call mpi_get_address(my_dtlrdh(1)%p, dh_offsets(1), ierr)
call mpi_get_address(my_dtlrdh(1)%q, dh_offsets(2), ierr)
call mpi_get_address(my_dtlrdh(1)%ind, dh_offsets(3), ierr)
call mpi_get_address(my_dtlrdh(1)%TLR(1,1), dh_offsets(4), ierr)
call mpi_get_address(my_dtlrdh(1)%dTLRdh(1,1,1,1), dh_offsets(5), ierr)
do i = 2, size(dh_offsets)
dh_offsets(i) = dh_offsets(i) - dh_offsets(1)
end do
dh_offsets(1) = 0
dh_types = (/MPI_INTEGER, MPI_INTEGER, MPI_INTEGER, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION/)
dh_blocks = (/1, 1, 1, 3*3, 3*3*3*3/)
call mpi_type_struct(5, dh_blocks, dh_offsets, dh_types, dh_dtype, ierr)
call mpi_type_commit(dh_dtype, ierr)
I then call gather via:
call mpi_gather(my_dtlrdh, my_num_pairs, dh_dtype, &
collected_dtlrdh, my_num_pairs, dh_dtype, &
root_image, intra_image_comm, ierr)
After I gather, I can then print out what everything should look like:
do i = 1, num_pairs, 1
write(stdout, *) my_dtlrdh(i)%p, collected_dtlrdh(i)%p, my_dtlrdh(i)%q, collected_dtlrdh(i)%q
end do
and this is where I see really strange information. The first few elements that are printed out look fine:
1 1 3 3
1 1 6 6
1 1 9 9
But the tail end of my vector looks like where I only send 174 elements instead of the full 194:
17 0 24 0
18 0 19 0
18 0 20 0
18 0 21 0
18 0 22 0
Given that there are 194 pairs, and both num_pairs and my_num_pairs equal 194, I'm confused. I went ahead and started to play around in desperation, and found that if I use this gather call instead of the one above, I get the full vector:
num_pairs = 2*num_pairs+40
call mpi_gather(my_dtlrdh, num_pairs, dh_dtype, &
collected_dtlrdh, num_pairs, dh_dtype, &
root_image, intra_image_comm, ierr)
which I found by just guess and check. While this may work for this small test system, it hardly seems like a scalable solution. I'm totally at a loss... Any ideas? And please, let me know if you guys need any more information from me.
MPI_TYPE_STRUCT is deprecated in favour of MPI_TYPE_CREATE_STRUCT. The latter takes conceptually the same arguments as the former but the array of offsets is of type INTEGER(KIND=MPI_ADDRESS_KIND), i.e. the type returned by MPI_GET_ADDRESS.
You should also consider alignment issues when using arrays of MPI datatypes, because, despite the SEQUENCE attribute, your derived type may be padded by the compiler with some bytes at the end. Thus it is a good idea to resize dh_dtype according to the difference between the output of MPI_GET_ADDRESS() applied to my_dtlrdh(1) and my_dtlrdh(2) by means of the MPI_TYPE_CREATE_RESIZED() subroutine.
This is explained in this lecture on datatypes at page 41
This is probably however not enough to explain your problem.

How to improve upon this?

There are n groups of friends staying in the queue in front of bus station. The i-th group consists of ai men. Also, there is a single bus that works on the route. The size of the bus is x, that is it can transport x men simultaneously.
When the bus comes (it always comes empty) to the bus station, several groups from the head of the queue goes into the bus. Of course, groups of friends don't want to split, so they go to the bus only if the bus can hold the whole group. In the other hand, none wants to lose his position, that is the order of groups never changes.
The question is: how to choose the size x of the bus in such a way that the bus can transport all the groups and everytime when the bus moves off the bus station there is no empty space in the bus (the total number of men inside equals to x)?
Input Format:
The first line contains the only integer n (1≤n≤10^5). The second line contains n space-separated integers a1,a2,…,an (1≤ai≤10^4).
Output Format:
Print all the possible sizes of the bus in the increasing order.
Sample:
8
1 2 1 1 1 2 1 3
Output:
3 4 6 12
I made this code:
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
int main(void)
{
int max=0,sum=0,i,n;
cin>>n;
int values[100000];
for ( i = 0; i < n; i++ )
{
cin>>values[i];
sum = sum + values[i];
if ( values[i] > max )
max = values[i];
}
int p = 0,j;
int count = 0;
vector<int> final;
for ( i = 0; i < n; i++ )
{
p = p + values[i];
j = 0;
if ( p >= max && sum%p == 0)
{
flag = 0;
while ( j < n )
{
garb = p;
while (garb!= 0)
{
garb = garb - values[j++];
if ( garb < 0 )
flag = 1;
}
}
if ( flag == 0 )
{
final.push_back(p);
count++;
}
}
}
sort(final.begin(),final.end());
for ( j = 0; j < count; j++ )
{
cout<<final[j]<<"\t";
}
return 0;
}
Edit: I did this in which basically, I am checking if the found divisor satisfies the condition, and if at any point of time, I get a negative integer on taking difference with the values, I mark it by using a flag. However, it seems to give me a seg fault now. Why?
I firstly, calculated the maximum value out of the all possible values, and then, I checked if its a divisor of the sum of the values. However, this approach doesn't work for the input as:
10
2 2 1 1 1 1 1 2 1 2
My output is
2 7 14
whereas the output should be
7 14
only.
Any other approach that I can go with?
Thanks!
I can think of the following simple solution (since your present concern is correctness and not time complexity):
Calculate the sum of all ai's (as you are already doing).
Calculate the maximum of all ai's (as you are already doing).
Find all the factors of sum that are > max(ai).
For each factor, iterate through the ai's and check whether the bus condition is satisfied.

Calling BLACS with more processes than used

I want to create a parallel program, which makes heavy use of SCALAPACK. The basis of SCALAPACK is BLACS, which itself relies on MPI for interprocess communication.
I want to start the program with a defined number of processes (e.g. the number of cores on the machine) and let the algorithm decide, how to use these processes for calculations.
As a testcase I wanted to use 10 processes. 9 of these processes should get arranged in a grid (BLACS_GRIDINIT) and the 10th process should wait till the other processes are finished.
Unfortunately, OpenMPI crashes because the last process doesn't get into a MPI context from BLACS, while the others did.
Question: What is the correct way to use BLACS with more processes than needed?
I did some experiments with additional MPI_INIT and MPI_FINALIZE calls, but none of my tries were successful.
I started with the sample code from Intel MKL (shortened a little bit):
PROGRAM HELLO
* -- BLACS example code --
* Written by Clint Whaley 7/26/94
* Performs a simple check-in type hello world
* ..
* .. External Functions ..
INTEGER BLACS_PNUM
EXTERNAL BLACS_PNUM
* ..
* .. Variable Declaration ..
INTEGER CONTXT, IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL
INTEGER ICALLER, I, J, HISROW, HISCOL
* Determine my process number and the number of processes in
* machine
CALL BLACS_PINFO(IAM, NPROCS)
* Set up process grid that is as close to square as possible
NPROW = INT( SQRT( REAL(NPROCS) ) )
NPCOL = NPROCS / NPROW
* Get default system context, and define grid
CALL BLACS_GET(0, 0, CONTXT)
CALL BLACS_GRIDINIT(CONTXT, 'Row', NPROW, NPCOL)
CALL BLACS_GRIDINFO(CONTXT, NPROW, NPCOL, MYPROW, MYPCOL)
* If I'm not in grid, go to end of program
IF ( (MYPROW.GE.NPROW) .OR. (MYPCOL.GE.NPCOL) ) GOTO 30
* Get my process ID from my grid coordinates
ICALLER = BLACS_PNUM(CONTXT, MYPROW, MYPCOL)
* If I am process {0,0}, receive check-in messages from
* all nodes
IF ( (MYPROW.EQ.0) .AND. (MYPCOL.EQ.0) ) THEN
WRITE(*,*) ' '
DO 20 I = 0, NPROW-1
DO 10 J = 0, NPCOL-1
IF ( (I.NE.0) .OR. (J.NE.0) ) THEN
CALL IGERV2D(CONTXT, 1, 1, ICALLER, 1, I, J)
END IF
* Make sure ICALLER is where we think in process grid
CALL BLACS_PCOORD(CONTXT, ICALLER, HISROW, HISCOL)
IF ( (HISROW.NE.I) .OR. (HISCOL.NE.J) ) THEN
WRITE(*,*) 'Grid error! Halting . . .'
STOP
END IF
WRITE(*, 3000) I, J, ICALLER
10 CONTINUE
20 CONTINUE
WRITE(*,*) ' '
WRITE(*,*) 'All processes checked in. Run finished.'
* All processes but {0,0} send process ID as a check-in
ELSE
CALL IGESD2D(CONTXT, 1, 1, ICALLER, 1, 0, 0)
END IF
30 CONTINUE
CALL BLACS_EXIT(0)
1000 FORMAT('How many processes in machine?')
2000 FORMAT(I)
3000 FORMAT('Process {',i2,',',i2,'} (node number =',I,
$ ') has checked in.')
STOP
END
Update: I investigated the source code of BLACS to see, what happens there.
The call BLACS_PINFO initializes the MPI context with MPI_INIT, if this didn't happen before. This means, that at this point, everything works as expected.
At the end, the call to BLACS_EXIT(0) should free all resources from BLACS and if the argument is 0, it should also call MPI_FINALIZE. Unfortunately, this doesn't work as expected and my last process doesn't call MPI_FINALIZE.
As a workaround, one could ask MPI_FINALIZED and call MPI_FINALIZE if necessary.
Update 2: My previous tries were done with Intel Studio 2013.0.079 and OpenMPI 1.6.2 on SUSE Linux Enterprise Server 11.
After reading ctheo's answer, I tried to compile this example with the tools given by Ubuntu 12.04 (gfortran 4.6.3, OpenMPI 1.4.3, BLACS 1.1) and was successful.
My conclusion is, that Intel's implementation appears to be buggy. I will retry this example in the not so far future with the newest service release of Intel Studio, but don't expect any changes.
However, I would appreciate any other (and maybe better) solution.
I don't know the answer, and I would hazard a guess that the set of people that participate in SO and those who know the answer to your question is < 1. However, I'd suggest that you might have slightly better luck asking on scicomp or by contacting the ScaLAPACK team at the University of Tennessee directly through their support page. Good luck!
I don't think that you need to do much to use less processes in SCALAPACK.
The BLACS_PINFO subroutine returns the total number of processes.
If you want to use one less, just do NPROCS = NPROCS - 1.
I used your sample code (fixes some typos in FORMAT), added the subtraction and got the following output:
$ mpirun -n 4 ./a.out
Process { 0, 0} (node number = 0) has checked in.
Process { 0, 1} (node number = 1) has checked in.
Process { 0, 2} (node number = 2) has checked in.
All processes checked in. Run finished.
The BLACS_GRIDINIT creates a grid with the reduced NPROCS.
By calling BLACS_GRIDINFO one process gets MYPROW=MYPCOL=-1
On the other hand if you want to create multiple grids that use different processes then probably you should use BLACS_GRIDMAP subroutine. The sample code below creates two equal grids with half size of total processes.
PROGRAM HELLO
* ..
INTEGER CONTXT(2), IAM, NPROCS, NPROW, NPCOL, MYPROW, MYPCOL
INTEGER ICALLER, I, J, HISROW, HISCOL
integer UMAP(2,10,10)
*
CALL BLACS_PINFO(IAM, NPROCS)
NPROCS = NPROCS/2
*
NPROW = INT( SQRT( REAL(NPROCS) ) )
NPCOL = NPROCS / NPROW
*
DO IP = 1, 2
DO I = 1, NPROW
DO J = 1, NPCOL
UMAP(IP,I,J) = (IP-1)*NPROCS+(I-1)*NPCOL+(J-1)
ENDDO
ENDDO
CALL BLACS_GET(0, 0, CONTXT(IP))
CALL BLACS_GRIDMAP(CONTXT(IP), UMAP(IP,:,:), 10, NPROW, NPCOL )
ENDDO
*
DO IP = 1, 2
CALL BLACS_GRIDINFO(CONTXT(IP), NPROW, NPCOL, MYPROW, MYPCOL)
IF(MYPROW.GE.0 .AND. MYPCOL.GE.0 ) THEN
WRITE(*,1000) IAM, MYPROW, MYPCOL, IP
END IF
ENDDO
CALL BLACS_EXIT(0)
1000 FORMAT('Process ',i2,' is (',i2,','i2 ') of grid ',i2)
*
STOP
END
I got the following output:
$ mpirun -n 8 ./a.out
Process 0 is ( 0, 0) of grid 1
Process 1 is ( 0, 1) of grid 1
Process 2 is ( 1, 0) of grid 1
Process 3 is ( 1, 1) of grid 1
Process 4 is ( 0, 0) of grid 2
Process 5 is ( 0, 1) of grid 2
Process 6 is ( 1, 0) of grid 2
Process 7 is ( 1, 1) of grid 2
I did not collect the data in process zero. So you can get this output if all processes are local.