I am trying to understand how to use OpenMP sections. The program listed below is extracted from one of the llnl tutorials, the explanation states: 'Simple program demonstrating that different blocks of work will be done by different threads'.
!!compile with: gfortran -fopenmp -o omp_worksections omp_worksections.f90
!! also need: export OMP_NUM_THREADS=2 (or 3 or 4)
PROGRAM WORKSECTIONS
INTEGER N, I, NTHREADS, TID, OMP_GET_NUM_THREADS,OMP_GET_THREAD_NUM
PARAMETER (N=1000)
REAL A(N), B(N), C(N), D(N)
! Some initializations
DO I = 1, N
A(I) = I * 1.5
B(I) = I + 22.35
C(N) = 0.0
D(N) = 0.0
ENDDO
!$OMP PARALLEL SHARED(A,B,C,D,NTHREADS), PRIVATE(I,TID)
TID = OMP_GET_THREAD_NUM()
IF (TID .EQ. 0) THEN
NTHREADS = OMP_GET_NUM_THREADS()
PRINT *, 'Number of threads =', NTHREADS
END IF
PRINT *, 'Thread',TID,' starting...'
!$OMP SECTIONS
!$OMP SECTION
PRINT *, 'Thread',TID,' doing section 1'
DO I = 1, N
C(I) = A(I) + B(I)
if (i.lt.10) then
WRITE(*,100) TID,I,C(I)
end if
100 FORMAT(' Thread',I2,': C(',I2,')=',F8.2)
ENDDO
!$OMP SECTION
PRINT *, 'Thread',TID,' doing section 2'
DO I = 1, N
if (i.lt.10) then
D(I) = A(I) * B(I)
WRITE(*,200) TID,I,D(I)
200 FORMAT(' Thread',I2,': D(',I2,')=',F8.2)
endif
ENDDO
!$OMP END SECTIONS NOWAIT
PRINT *, 'Thread',TID,' done.'
!$OMP END PARALLEL
END PROGRAM WORKSECTIONS
When I compile and run, the result is:
Number of threads = 2
Thread 0 starting...
Thread 0 doing section 1
Thread 0: C( 1)= 24.85
Thread 0: C( 2)= 27.35
Thread 0: C( 3)= 29.85
Thread 0: C( 4)= 32.35
Thread 0: C( 5)= 34.85
Thread 0: C( 6)= 37.35
Thread 0: C( 7)= 39.85
Thread 0: C( 8)= 42.35
Thread 0: C( 9)= 44.85
Thread 1 starting...
Thread 0 doing section 2
Thread 0: D( 1)= 35.03
Thread 0: D( 2)= 73.05
Thread 0: D( 3)= 114.08
Thread 0: D( 4)= 158.10
Thread 0: D( 5)= 205.12
Thread 0: D( 6)= 255.15
Thread 0: D( 7)= 308.18
Thread 0: D( 8)= 364.20
Thread 0: D( 9)= 423.23
Thread 0 done.
Thread 1 done.
It seems thread 0 does both sections 1 and 2? I was expecting the prints from either section to be interleaved with one thread doing section 1 and the other section 2.
I have tried removing the NOWAIT clause in the END SECTIONS DIRECTIVE, and removing C,D from the shared clause in the PARALLEL directive, to no avail.
I am obviously missing some central piece of the puzzle?
It appears that when the OpenMP runtime library is looking for a free thread for the second section, it finds thread 0 free again, because there is too little work to do in the first section. So it assigns the work to thread 0 again.
Try larger n, like 100000:
Number of threads = 2
Thread 0 starting...
Thread 0 doing section 1
Thread 0: C( 1)= 24.85
Thread 0: C( 2)= 27.35
Thread 0: C( 3)= 29.85
Thread 0: C( 4)= 32.35
Thread 0: C( 5)= 34.85
Thread 1 starting...
Thread 1 doing section 2
Thread 1: D( 1)= 35.03
Thread 1: D( 2)= 73.05
Thread 1: D( 3)= 114.08
Thread 1: D( 4)= 158.10
Thread 1: D( 5)= 205.12
Thread 1: D( 6)= 255.15
Thread 1: D( 7)= 308.18
Thread 1: D( 8)= 364.20
Thread 1: D( 9)= 423.23
Thread 0: C( 6)= 37.35
Thread 0: C( 7)= 39.85
Thread 0: C( 8)= 42.35
Thread 0: C( 9)= 44.85
Thread 1 done.
Thread 0 done.
Related
I use MPI_Probe to determine the size of a dynamic array that i just pass as a tag. but i send two arrays. the basic structure is :
call MPI_Isend(message, destination, MPI_INT, size, COMM, status, error)
call MPI_Isend(message2, destination, MPI_DOUBLE, size*3, COMM, status, error)
...
call MPI_Probe(sender, MPI_ANY_TAG, COMM, status, error)
size1 = status(MPI_TAG)
call MPI_Probe(sender, MPI_ANY_TAG, COMM, status, error)
size2 = status(MPI_TAG)
actual_size = MIN(size1, size2)
call MPI_Recv(size)
call MPI_Recv(size*3)
So this doesn't work because MPI_Probe will just get same value twice. Any idea how to cycle through different probes or something?
If this doesn't work i plan to change my code around to just have send - recv - send - recv instead of send - send - recv - recv. Just checking if someone has better solutions
As stated in the comments you shouldn't use the tag to send size data from one process to another as the value the tag can take has an upper bound of MPI_TAG_UB that in theory could be quite small, potentially not allow you to communicate a large enough integer. In fact it's bad practice to use the tag to transmit information at all, you should use the message data, that's what it is for after. Victor Eijkhout has the right way, you should inquire of the status argument using MPI_GET_COUNT how many things are being transmitted and use that to allocate the dynamic array. Here is an example, which also uses the correct handles for datatypes, you are using the C rather than the Fortran variants:
ijb#ijb-Latitude-5410:~/work/stack$ cat probe.f90
Program probe
Use, Intrinsic :: iso_fortran_env, Only : stdout => output_unit
Use mpi_f08, Only : mpi_status, mpi_comm_world, mpi_integer, &
mpi_init, mpi_finalize, mpi_send, mpi_probe, mpi_get_count, mpi_recv
Implicit None
Type( mpi_status ) :: status
Real :: rnd
Integer, Dimension( : ), Allocatable :: stuff
Integer :: nprc, rank
Integer :: n
Integer :: error
Integer :: i
Call mpi_init( error )
Call mpi_comm_size( mpi_comm_world, nprc, error )
Call mpi_comm_rank( mpi_comm_world, rank, error )
If( rank == 0 ) Then
Write( stdout, * ) 'Running on ', nprc, ' procs'
End If
If( rank == 0 ) Then
! On rank zero generate a random sized array
Call Random_number( rnd )
n = Int( 10.0 * rnd + 1 )
Write( stdout, * ) 'Allocating ', n, ' elements on rank 0'
Allocate( stuff( 1:n ) )
stuff = [ ( i, i = 1, n ) ]
Write( stdout, '( "Data on proc 0: ", *( i0, 1x ) )' ) stuff
Call mpi_send( stuff, Size( stuff ), mpi_integer, 1, 10, &
mpi_comm_world, error )
Else If( rank == 1 ) Then
! On rank 1 probe the message to get the status
Call mpi_probe( 0, 10, mpi_comm_world, status, error )
! From the status find how many things are being sent
Call mpi_get_count( status, mpi_integer, n, error )
! Use that to allocate the array
Allocate( stuff( 1:n ) )
! And recv the date
Call mpi_recv( stuff, Size( stuff ), mpi_integer, 0, 10, &
mpi_comm_world, status, error )
Write( stdout, * ) 'Recvd ', n, ' integers on proc 1'
Write( stdout, '( "Data on proc 1: ", *( i0, 1x ) )' ) stuff
Else
Write( stdout, * ) 'Busy doing nothing ... ', rank
End If
Write( stdout, * ) 'done', rank
Call mpi_finalize( error )
End Program probe
ijb#ijb-Latitude-5410:~/work/stack$ mpif90 --version
GNU Fortran (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb#ijb-Latitude-5410:~/work/stack$ mpif90 -std=f2018 -Wall -Wextra -pedantic -fcheck=all -fbacktrace -Wuse-without-only -Werror -g probe.f90
ijb#ijb-Latitude-5410:~/work/stack$ mpirun -np 2 ./a.out
Running on 2 procs
Allocating 8 elements on rank 0
Data on proc 0: 1 2 3 4 5 6 7 8
done 0
Recvd 8 integers on proc 1
Data on proc 1: 1 2 3 4 5 6 7 8
done 1
ijb#ijb-Latitude-5410:~/work/stack$ mpirun -np 2 ./a.out
Running on 2 procs
Allocating 5 elements on rank 0
Data on proc 0: 1 2 3 4 5
done 0
Recvd 5 integers on proc 1
Data on proc 1: 1 2 3 4 5
done 1
ijb#ijb-Latitude-5410:~/work/stack$ mpirun -np 2 ./a.out
Running on 2 procs
Allocating 2 elements on rank 0
Data on proc 0: 1 2
done 0
Recvd 2 integers on proc 1
Data on proc 1: 1 2
done 1
ijb#ijb-Latitude-5410:~/work/stack$ mpirun -np 2 ./a.out
Recvd 1 integers on proc 1
Data on proc 1: 1
done 1
Running on 2 procs
Allocating 1 elements on rank 0
Data on proc 0: 1
done 0
ijb#ijb-Latitude-5410:~/work/stack$ mpirun -np 2 ./a.out
Running on 2 procs
Allocating 3 elements on rank 0
Data on proc 0: 1 2 3
done 0
Recvd 3 integers on proc 1
Data on proc 1: 1 2 3
done 1
ijb#ijb-Latitude-5410:~/work/stack$
I used the following statement from OpenMP:
omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
printf("id is %3d thread is %d\n",i, omp_get_thread_num());
}
I found out (in a blog post): each thread will be evenly allocated iterations, and, when the number of threads is not divisible by the number of iterations, it will be rounded up.
Well, I first set NUMS=17, the result is as follows:
id is 12 thread is 4
id is 13 thread is 4
id is 14 thread is 4
id is 9 thread is 3
id is 10 thread is 3
id is 11 thread is 3
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 6 thread is 2
id is 7 thread is 2
id is 8 thread is 2
id is 15 thread is 5
id is 16 thread is 5
id is 3 thread is 1
id is 4 thread is 1
id is 5 thread is 1
As can be seen, \lceil 17/6 \rceil = 3 (round up, forgive me, I don't know how to insert Latex formulas), the result is as expected.
However, if I set NUMS=19, according to rounding up 19/6 = 4, each thread should be allocated 4 iterations, however:
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 3 thread is 0
id is 10 thread is 3
id is 11 thread is 3
id is 12 thread is 3
id is 7 thread is 2
id is 8 thread is 2
id is 9 thread is 2
id is 13 thread is 4
id is 14 thread is 4
id is 15 thread is 4
id is 4 thread is 1
id is 5 thread is 1
id is 6 thread is 1
id is 16 thread is 5
id is 17 thread is 5
id is 18 thread is 5
As you can see, only the first one is assigned 4 iterations.
So I can't figure it out now, what exactly is the reason for this? What exactly is OpenMP's default thread scheduling?
Summarising all the comments to create an answer (so thanks to all who commented).
First, what the standard says/requires :-
It says nothing about which schedule should be used when it is unspecified.
schedule(static) with no chunk_size is only specified as allocating "approximately equal" numbers of iterations to each available thread.
Second, what happens in reality at present :-
Compilers default to using schedule(static) when no schedule is specified. (Though schedule(nonmonotonic:dynamic) might, now, be a better choice.)
At least the LLVM OpenMP runtime allocates iterations to threads as this Python code shows (the critical part is myCount, the rest is just to test it and show your test cases).
#
# Show number of iterations allocated to each thread
# by schedule(static) in the LLVM runtime
#
def myCount(me, numThreads, total):
base = total // numThreads
remainder = total % numThreads
return base+1 if me < remainder else base
def test(numThreads, total):
print("Threads: ",numThreads, " Iterations: ",total)
allocated = 0
for thread in range(0,numThreads):
mine = myCount(thread,numThreads,total)
allocated += mine
print (thread, ": ", mine)
if allocated != total:
print ("***ERROR*** ", allocated," allocated, ", total," requested")
test(6,17)
test(6,19)
If you run that you can see the result for your two test cases:-
% python3 static.py
Threads: 6 Iterations: 17
0 : 3
1 : 3
2 : 3
3 : 3
4 : 3
5 : 2
Threads: 6 Iterations: 19
0 : 4
1 : 3
2 : 3
3 : 3
4 : 3
5 : 3
If you want to get into the full horror of loop scheduling, there is a whole chapter on this in "High Performance Parallel Runtimes -- Design and Implementation"
p.s. It's worth noting that the schedule shown above cannot be explicitly requested by setting a block_size on a static schedule, since the standard does not then allow the remainder iterations to be split up as they are here. (E.g. If we try to allocate 10 iterations to 4 threads, if we set block_size(2) we'd get (4,2,2,2) whereas if we set it to 3 we'd get (3,3,3,1), whereas the scheme above gives (3,3,2,2), which has imbalance of 1 whereas the explicit schemes each have imbalance of 2).
I want to add a contention condition in this code in order to
understand the nature of threads. But i am not able to get it through
this code.
#include<stdio.h>
#include<pthread.h>
#include<semaphore.h>
#include<omp.h>
sem_t rsem;
sem_t wsem;
int readCount=0;
int writeCount=0;
void *Reader(void *arg);
void *Writer(void *arg);
int main()
{
int i=0,NumberofReaderThread=0,NumberofWriterThread=0,TID, nthreads;
sem_init(&rsem,0,1);
sem_init(&wsem,0,1);
pthread_t Readers_thr[100],Writer_thr[100];
printf("\nEnter number of Readers thread(MAX 10) : ");
scanf("%d",&NumberofReaderThread);
printf("\nEnter number of Writers thread(MAX 10) : ");
scanf("%d",&NumberofWriterThread);
for(i=0;i<NumberofReaderThread;i++)
{
pthread_create(&Readers_thr[i],NULL,Reader,(void *)i);
}
for(i=0;i<NumberofWriterThread;i++)
{
pthread_create(&Writer_thr[i],NULL,Writer,(void *)i);
}
for(i=0;i<NumberofWriterThread;i++)
{
pthread_join(Writer_thr[i],NULL);
}
for(i=0;i<NumberofReaderThread;i++)
{
pthread_join(Readers_thr[i],NULL);
}
i=0;
pthread_create(&Readers_thr[i],NULL,Reader,(void *)i);
pthread_create(&Writer_thr[i],NULL,Writer,(void *)i);
i=1;
pthread_create(&Readers_thr[i],NULL,Reader,(void *)i);
pthread_create(&Writer_thr[i],NULL,Writer,(void *)i);
i=0;
pthread_join(Writer_thr[i],NULL);
pthread_join(Readers_thr[i],NULL);
i=1;
pthread_join(Writer_thr[i],NULL);
pthread_join(Readers_thr[i],NULL);
sem_destroy(&rsem);
sem_destroy(&wsem);
return 0;
}
void * Writer(void *arg)
{
int TID, nthreads,temp;
#pragma omp parallel default(none), private(TID), shared(nthreads,wsem,rsem,temp,arg,writeCount)
{
//sleep(3);
TID = omp_get_thread_num();
if (TID == 0)
{
nthreads = omp_get_num_threads();
printf("\n Number of threads = (%d) \n",nthreads);
}
temp=(int)arg;
printf("\n Thread %d :: Writer %d is trying to enter into database ",TID,temp);
sem_wait(&wsem);
writeCount++;
if(writeCount==1) sem_wait(&rsem);
printf("\n Thread %d :: Writer %d is writing into the database ",TID,temp);
//sleep(5);
printf("\n Thread %d :: Writer %d is leaving the database ",TID,temp);
sem_post(&wsem);
sem_post(&rsem);
}
}
void *Reader(void *arg)
{
int TID, nthreads,temp;
#pragma omp parallel default(none), private(TID),shared(nthreads,wsem,rsem,readCount,temp,arg)
{
//sleep(3);
TID = omp_get_thread_num();
if (TID == 0)
{
nthreads = omp_get_num_threads();
printf("\n Number of threads = (%d) \n",nthreads);
}
temp=(int)arg;
printf("\n Thread %d :: Reader %d is trying to enter into the Database ",TID,temp);
sem_wait(&rsem);
readCount++;
if(readCount==1)
{
sem_wait(&wsem);
//sleep(5);
printf("\n Thread %d :: Reader %d is reading the database ",TID,temp);
}
sem_post(&rsem);
printf("\n Thread %d :: Reader %d is reading the database ",TID,temp);
sem_wait(&rsem);
readCount--;
printf("\nThread %d :: Reader %d is leaving the database ",TID,temp);
if(readCount==0)
{
sem_post(&wsem);
}
sem_post(&rsem);
}
}
Enter number of Readers thread(MAX 10) : 1
Enter number of Writers thread(MAX 10) : 1
Thread 3 :: Reader 0 is trying to enter into the Database Thread
3 :: Reader 0 is reading the database Thread 2 :: Reader 0 is
trying to enter into the Database Thread 3 :: Writer 0 is trying to
enter into database Thread 1 :: Writer 0 is trying to enter into
database Thread 1 :: Reader 0 is trying to enter into the Database
Number of threads = (4)
Thread 0 :: Writer 0 is trying to enter into database Thread 2 ::
Writer 0 is trying to enter into database Number of threads = (4)
Thread 0 :: Reader 0 is trying to enter into the Database Thread
2 :: Reader 0 is reading the database Thread 2 :: Reader 0 is
reading the database Thread 2 :: Reader 0 is leaving the database
Thread 1 :: Reader 0 is reading the database Thread 1 :: Reader 0
is leaving the database Thread 0 :: Reader 0 is reading the
database Thread 0 :: Reader 0 is reading the database Thread 0
:: Reader 0 is leaving the database Thread 1 :: Writer 0 is
writting into the database Thread 1 :: Writer 0 is leaving the
database Thread 0 :: Writer 0 is writting into the database
Thread 0 :: Writer 0 is leaving the database Thread 2 :: Writer 0
is writting into the database Thread 2 :: Writer 0 is leaving the
database Thread 3 :: Writer 0 is writting into the database
Thread 3 :: Writer 0 is leaving the database Thread 3 :: Writer 0
is trying to enter into database Thread 3 :: Writer 0 is writting
into the database Thread 3 :: Writer 0 is leaving the database
Thread 1 :: Writer 0 is trying to enter into database Thread 1 ::
Writer 0 is writting into the database Thread 1 :: Writer 0 is
leaving the database Thread 2 :: Writer 0 is trying to enter into
database Thread 2 :: Writer 0 is writting into the database
Thread 2 :: Writer 0 is leaving the database Number of threads =
(4)
Thread 0 :: Writer 0 is trying to enter into database Thread 0 ::
Writer 0 is writting into the database Thread 0 :: Writer 0 is
leaving the database Thread 1 :: Reader 0 is trying to enter into
the Database Thread 1 :: Reader 0 is reading the database Thread
3 :: Reader 1 is trying to enter into the Database Thread 2 ::
Reader 0 is trying to enter into the Database Thread 2 :: Reader 0
is reading the database Thread 2 :: Reader 0 is leaving the
database Thread 1 :: Reader 0 is reading the database Thread 1
:: Reader 0 is leaving the database Number of threads = (4)
Thread 0 :: Reader 1 is trying to enter into the Database Thread
0 :: Reader 1 is reading the database Thread 0 :: Reader 1 is
leaving the database Number of threads = (4)
Thread 0 :: Writer 1 is trying to enter into database Thread 3 ::
Reader 1 is reading the database Thread 3 :: Reader 1 is leaving
the database Thread 1 :: Writer 1 is trying to enter into database
Thread 3 :: Reader 0 is trying to enter into the Database Thread
2 :: Reader 1 is trying to enter into the Database Thread 2 ::
Reader 1 is reading the database Thread 2 :: Reader 1 is leaving
the database Thread 1 :: Reader 1 is trying to enter into the
Database Thread 1 :: Reader 1 is reading the database Thread 1
:: Reader 1 is leaving the database Thread 2 :: Writer 1 is trying
to enter into database Thread 3 :: Writer 1 is trying to enter into
database Number of threads = (4)
Thread 0 :: Reader 0 is trying to enter into the Database Thread
0 :: Reader 0 is reading the database Thread 0 :: Reader 0 is
leaving the database Thread 0 :: Writer 1 is writting into the
database Thread 0 :: Writer 1 is leaving the database Thread 1
:: Writer 1 is writting into the database Thread 1 :: Writer 1 is
leaving the database Thread 3 :: Reader 0 is reading the database
Thread 3 :: Reader 0 is reading the database Thread 3 :: Reader 0
is leaving the database Thread 2 :: Writer 1 is writting into the
database Thread 2 :: Writer 1 is leaving the database Thread 3
:: Writer 1 is writting into the database Thread 3 :: Writer 1 is
leaving the database
I can not understand why !$OMP DO is actually distributing tasks to different threads but cannot be detected by using openMP intrinsic function OMP_GET_THREAD_NUM().
program test
implicit none
integer :: i,su
double precision a(10), b(10),c
INTEGER OMP_GET_THREAD_NUM
su=0
!$OMP DO
do i=1,10
b(i) = 10*i;
c = b(i);
write(*,*)'in the loop, rank =',c,OMP_GET_THREAD_NUM()
enddo
!$OMP END DO
!$OMP PARALLEL
write(*,*) 'Rank = ',OMP_GET_THREAD_NUM()
!$OMP END PARALLEL
end
The result is:
in the loop, rank = 10.000000000000000 0
in the loop, rank = 20.000000000000000 0
in the loop, rank = 30.000000000000000 0
in the loop, rank = 40.000000000000000 0
in the loop, rank = 50.000000000000000 0
in the loop, rank = 60.000000000000000 0
in the loop, rank = 70.000000000000000 0
in the loop, rank = 80.000000000000000 0
in the loop, rank = 90.000000000000000 0
in the loop, rank = 100.00000000000000 0
Rank = 0
Rank = 6
Rank = 1
Rank = 7
Rank = 2
Rank = 5
Rank = 4
Rank = 3
See? It seems that only Master thread can be seen by public in the DO-LOOP. It is unfair since it is not his only one contribution.
Your do loop is not in a parallel region, so it is not parallelized -- all loop indices are processed by thread 0.
If I change your program to include a parallel region
...
!$OMP PARALLEL
!$OMP DO
do i=1,10
b(i) = 10*i;
c = b(i);
write(*,*)'in the loop, rank =',c,OMP_GET_THREAD_NUM()
enddo
!$OMP END DO
!$OMP END PARALLEL
...
then I get proper output of OMP thread numbers:
in the loop, rank = 50.000000000000000 8
in the loop, rank = 20.000000000000000 3
in the loop, rank = 20.000000000000000 7
in the loop, rank = 20.000000000000000 4
in the loop, rank = 20.000000000000000 5
in the loop, rank = 20.000000000000000 9
in the loop, rank = 20.000000000000000 6
in the loop, rank = 20.000000000000000 0
in the loop, rank = 20.000000000000000 1
in the loop, rank = 30.000000000000000 2
This particular output also exposes a flaw in your code, namely that c is shared, so its value is being clobbered by each thread. Also, if the do loop is the only thing in the parallel region you can combine the OMP directive. Finally, if we change your code to:
!$OMP PARALLEL DO private(c)
do i=1,10
b(i) = 10*i;
c = b(i);
write(*,*)'in the loop, rank =',c,OMP_GET_THREAD_NUM()
enddo
!$OMP END PARALLEL DO
then the output will be correct.
in the loop, rank = 100.00000000000000 9
in the loop, rank = 30.000000000000000 2
in the loop, rank = 20.000000000000000 1
in the loop, rank = 70.000000000000000 6
in the loop, rank = 60.000000000000000 5
in the loop, rank = 50.000000000000000 4
in the loop, rank = 80.000000000000000 7
in the loop, rank = 90.000000000000000 8
in the loop, rank = 10.000000000000000 0
in the loop, rank = 40.000000000000000 3
I have a computer with nproc processors and I'd like to initialize two blacs grids, one of the dimension p x q = nprocs and one of the dimension 1 x 1.
Assume MPI allready initialized and a routine finding good block sizes, the first grid is initialized via
call blacs_get( -1, 0, self%context )
call blacs_gridinit( self%context, 'R', self%nprows, self%npcols )
call blacs_gridinfo( self%context, self%nprows, self%npcols, self%myrow, self%mycol )
But how do I set up the second? Do I have to introduce another mpi communicator first?
As an answer and example, I share this implementation:
call blacs_get( -1, 0, self%context )
call blacs_gridinit( self%context, 'R', self%nprows, self%npcols )
call blacs_gridinfo( self%context, self%nprows, self%npcols, self%myrow, self%mycol )
print*, "A ", self%context, self%nprows, self%npcols, self%myrow, self%mycol
call sleep(1)
call blacs_get( -1, 0, val )
call blacs_gridinit( val, 'R', 1, 1 )
call blacs_gridinfo( val, self%nprows, self%npcols, self%myrow, self%mycol )
call sleep(1)
print*, "B ", val, self%nprows, self%npcols, self%myrow, self%mycol
call sleep(1)
call blacs_get( -1, 0, val2 )
call blacs_gridinit( val2, 'R', 2, 2 )
call blacs_gridinfo( val2, self%nprows, self%npcols, self%myrow, self%mycol )
call sleep(1)
print*, "C ", val2, self%nprows, self%npcols, self%myrow, self%mycol
Which adds three blacs context, no need to initialize another MPI communicator, and amounts to the following output on four cores:
A 0 2 2 1 1
A 0 2 2 0 0
A 0 2 2 1 0
A 0 2 2 0 1
B -1 -1 -1 -1 -1
B -1 -1 -1 -1 -1
B -1 -1 -1 -1 -1
B 1 1 1 0 0
C 1 2 2 1 0
C 1 2 2 1 1
C 1 2 2 0 1
C 2 2 2 0 0
So, the crucial point is that the first argument of blacs_gridinit is an input/output argument, needing the globale blacs context of all processes as an input. It is recived in a new variable by the call to blacs_get, third argument.
What I found quite counter intuitive in this case is the fact, that the value of the context seems to follow some kind of sum rule, so after initializing the 1x1 grid and then again a 4x4 grid, the values of the 4x4 grid handle are not the same on all processes.