openmp Fortran--- Code showing same performance - fortran

I am a new user of openmp. I have written the following code in fortran and tried to add parallel feature to it using openmp. Unfortunately, it is taking same time as serial version of this subroutine. I am compiling it using this f2py command. Am sure, I am missing a key concept here but unable to figure it out. Will really appreciate on getting help on this.
!f2py -c --opt='-O3' --f90flags='-fopenmp' -lgomp -m g3Test g3TestA.f90
exp1 =0.0
exp2 =0.0
exp3 =0.0
!$OMP PARALLEL DO shared(xConfig,s1,s2,s3,c1,c2,c3) private(h)&
!$OMP REDUCTION(+:exp1,exp2,exp3)
do k=0,numRows-1
xConfig(0:2) = X(k,0:2)
do h=0,nPhi-1
exp1(h) = exp1(h)+exp(-((xConfig(0)-c1(h))**2)*s1)
exp2(h) = exp2(h)+exp(-((xConfig(1)-c2(h))**2)*s2)
exp3(h) = exp3(h)+exp(-((xConfig(2)-c3(h))**2)*s3)
end do
end do
!$OMP END PARALLEL DO
ALine = exp1+exp2+exp3

As neatly explained in this OpenMP Performance training course material from the University of Edinburgh for example, there are a number of reasons why OpenMP code does not necessarily scale as you would expect (for example how much of the serial runtime is taken by the part you are parallelising, synchronisation between threads, communication, and other parallel overheads).
You can easily test the performance with different numbers of threads by calling your python script like, e.g. with 2 threads:
env OMP_NUM_THREADS=2 python <your script name>
and you may consider adding the following lines in your code example to get a visual confirmation of the number of threads being used in the OpenMP part of your code:
do k=0,numRows-1
!this if-statement is only for debugging, remove for timing
!$ if (k==0) then
!$ print *, 'num_threads running:', OMP_get_num_threads()
!$ end if
xConfig(0:2) = X(k,0:2)

Related

SIGBUS occurs when fortran code reads file on linux cluster

When I run my fortran code in parallel on a linux cluster with mpirun I get a sigbus error.
It occurs while reading a file, the timing is irregular, and sometimes it proceeds without error.
I have tried debug compilation options like -g, but I haven't gotten any information on what line the error is coming from.
Actually the code was executed previously in three different clusters without this error, but the error is only occurring on this machine.
I personally suspect this is related to the performance of the machine (especially storage i/o), but I am not sure.
The program code is simple. Each process executed by mpirun reads the file corresponding to its rank as follows.
!!!!!!!!!! start of code
OPEN(11, FILE='FILE_NAME_WITH_RANK', FORM='UNFORMATTED')
READ(11,*) ISIZE
ALLOCATE(SOME_VARIABLE(ISIZE))
DO I = 1, ISIZE
READ(11,*) SOME_VARIABLE(I)
ENDDO
READ(11,*) ISIZE2
ALLOCATE(SOME_VARIABLE2(ISIZE2))
DO I = 1, ISIZE2
READ(11,*) SOME_VARIABLE2(I)
ENDDO
! MORE VARIABLES
CLOSE(11)
!!!!!!!!!! end of code
I used 191 cpu, and the total size of 191 files it loads is about 11 GB.
The cluster used for execution consists of 24 nodes with 16 cpu each (384 cpu total) and uses common storage that is shared with another cluster.
I ran the code in parallel by specifying nodes 1 through 12 as the hostfile.
Initially, I had 191 cpu read all files at the same time out of sequence.
After doing so, the program ended with a sigbus error. Also, for some nodes, the ssh connection was delayed, and the bashrc file cannot be found by node with stale file handle error.
The stale file handle error waited a bit and it seemed to recover by itself, but I'm not sure what the system administrator did.
So, I changed it to the following code so that only one cpu can read the file at a time.
!!!!!!!!!! start of code
DO ICPU = 0, NUMBER_OF_PROCESS-1
IF(ICPU.EQ.MY_PROCESS) CALL READ_FILE
CALL MPI_BARRIER(MPI_COMMUNICATOR,IERR)
ENDDO
!!!!!!!!!! end of code
This seemed to work fine for single execution, but if I ran more than one of these programs at the same time, the first mpirun stopped and both ended with a sigbus error eventually.
My next attempt is to minimize the execution of the read statement by deleting the do statement when reading the array. However, due to limited time, I couldn't test the effectiveness of this modification.
Here are some additional information.
If I execute a search or copy a file with an explorer such as nautilus while running a parallel program, nautilus does not respond or the running program raise sigbus. In severe cases, I wasn't able to connect the VNC server with stale file handle errors.
I use OpenMPI 2.1.1, GNU Fortran 4.9.4.
I compile the program with following
$OPENMPIHOME/bin/mpif90 -mcmodel=large -fmax-stack-var-size-64 -cpp -O3 $SOURCE -o $EXE
I execute the program with following in gnome terminal
$OPENMPIHOME/bin/mpirun -np $NP -x $LD_LIBRARY_PATH --hostfile $HOSTFILE $EXE
The cluster is said to be running commercial software like FLUENT without problems.
Summing up the above, my personal suspicion is that the storage of the cluster is dismounted due to the excessive disk I/O generated by my code, but I don't know if this makes sense because I have no cluster knowledge.
If yes, I wonder if there is a way to minimize the disk I/O, if it is enough to proceed with the vectorized I/O mentioned above, or if there is an additional part.
I would appreciate it if you could tell me anything about the problem.
Thanks in advance.
!!!
I wrote an example code. As mentioned above, it may not be easy to reproduce because the occurrence varies depending on the machine.
PROGRAM BUSWRITE
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
ARRAY1(I) = I
WRITE(11) ARRAY1(I)
ENDDO
DO I = 1, ISIZE2
ARRAY2(I) = I**2
WRITE(11) ARRAY2(I)
ENDDO
DO I = 1, ISIZE3
ARRAY3(I) = I**3
WRITE(11) ARRAY3(I)
ENDDO
CLOSE(11)
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./buswrite.f90 -o ./buswrite
mpirun -np 32 ./buswrite
I've got 32 000.DAT ~ 031.DAT
PROGRAM BUSREAD
IMPLICIT NONE
INTEGER, PARAMETER :: ISIZE1 = 10000, ISIZE2 = 20000, ISIZE3 = 30000
DOUBLE PRECISION, ALLOCATABLE :: ARRAY1(:), ARRAY2(:), ARRAY3(:)
INTEGER :: I
INTEGER :: I1, I2, I3
CHARACTER*3 CPUNUM
INCLUDE 'mpif.h'
INTEGER ISTATUS(MPI_STATUS_SIZE)
INTEGER :: IERR, NPES, MYPE
CALL MPI_INIT(IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,NPES,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,MYPE,IERR)
I1=MOD(MYPE/100,10)+48
I2=MOD(MYPE/10 ,10)+48
I3=MOD(MYPE ,10)+48
CPUNUM=CHAR(I1)//CHAR(I2)//CHAR(I3)
OPEN(11, FILE=CPUNUM//'.DAT', FORM='UNFORMATTED')
ALLOCATE(ARRAY1(ISIZE1))
ALLOCATE(ARRAY2(ISIZE2))
ALLOCATE(ARRAY3(ISIZE3))
DO I = 1, ISIZE1
READ(11) ARRAY1(I)
IF(ARRAY1(I).NE.I) STOP
ENDDO
DO I = 1, ISIZE2
READ(11) ARRAY2(I)
IF(ARRAY2(I).NE.I**2) STOP
ENDDO
DO I = 1, ISIZE3
READ(11) ARRAY3(I)
IF(ARRAY3(I).NE.I**3) STOP
ENDDO
CLOSE(11)
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR)
IF(MYPE.EQ.0) WRITE(*,*) 'GOOD'
CALL MPI_FINALIZE(IERR)
END PROGRAM
mpif90 -ffree-line-length-0 ./busread.f90 -o ./busread
mpirun -np 32 ./busread
I've got 'GOOD' output text from terminal as expected, but the machine in question is terminated with a sigbus error while running busread.
The issue was not observed after a device reboot. Even though I ran 4 programs at the same time under the same conditions, no problem occurred. In addition, other teams that used the device also had similar problems, which were resolved after reboot. The conclusion is a bit ridiculous, but if there are any people experiencing similar problems, I would like to summarize it as follows.
If your program terminates abnormally due to a memory error (like sigbus and sigsegv) while reading or writing a file, you can check the following.
Make sure there are no errors in your program. Check whether the time of occurrence of the error is constant or irregular, whether other programs have the same symptoms, whether it runs well on other machines, and whether there is a problem when run with a memory error checking tool such as valgrind.
Optimize the file I/O part. In the case of fortran, processing an entire array is tens of times faster than processing by element.
Immediately after an error occurs, try ssh connection to the machine (or node) to check whether the connection is smooth and that the file system is well accessed. If you cannot access the bashrc file or an error such as stale file handle occurs, please contact the system manager after combining the above reviewed information.
If someone has anything to add or if this post isn't appropriate, please let me know.

Coarray deadlock in do-cycle-exit

A strange phenomenon occurs in the following Coarray code
program strange
implicit none
integer :: counter = 0
logical :: co_missionAccomplished[*]
co_missionAccomplished = .false.
sync all
do
if (this_image()==1) then
counter = counter+1
if (counter==2) co_missionAccomplished = .true.
sync images(*)
else
sync images(1)
end if
if (co_missionAccomplished[1]) exit
cycle
end do
write(*,*) "missionAccomplished on image ", this_image()
end program strange
This program never ends, as it appears that there is a deadlock for any counter threshold beyond 1 inside the loop. The code is compiled with Intel Fortran 2018 Windows OS, with the following flags:
ifort /debug /Qcoarray=shared /standard-semantics /traceback /gen-interfaces /check /fpe:0 normal.f90 -o run.exe
The same code, using DO WHILE construct, also appears to suffer from the same phenomenon:
program strange
implicit none
integer :: counter = 0
logical :: co_missionAccomplished[*]
co_missionAccomplished = .true.
sync all
do while(co_missionAccomplished[1])
if (this_image()==1) then
counter = counter+1
if (counter==2) co_missionAccomplished = .false.
sync images(*)
else
sync images(1)
end if
end do
write(*,*) "missionAccomplished on image ", this_image()
end program strange
This seems now too trivial to be a compiler bug, so I am probably missing something important about do-loops in parallel. any help is appreciated.
UPDATE:
Adding a SYNC ALL statement before the CYCLE statement in the DO-CYCLE-EXIT example program above resolves the deadlock. Also, a SYNC ALL statement right after DO WHILE statement, as the first line of the block resolves the deadlock. So apparently, all the images must be synced to avoid a deadlock before each cycle of the loops in either case above.
Regarding "This seems now too trivial to be a compiler bug", you may be very surprised about how seemingly trivial things can be treated incorrectly by a compiler. Few things relating to coarrays are trivial.
Consider the following program which is related:
implicit none
integer i[*]
do i=1,1
sync all
print '(I1)', i[1]
end do
end
I get the initially surprising output
1
2
when run with two images under ifort 2018.1.
Let's look at what's going on.
In my loop, i[1] first has value 1 when the images are synchronized. However, by the time the second image accesses the value, it's been changed by the first image ending its iteration.
We solve that little problem by putting an extra synchronization statement before the end do.
How is this program related to the one of the question? It's the same lack of synchronization between testing a value on a remote image and that image updating it.
Between the synchronization and other images testing the value of co_missionAccomplished[1], the first image may dash around and update counter, then co_missionAccomplished. Some images may see the exit state in their first iteration.

Nested OpenMP parallel regions not iterating as expected

This is probably a dumb question, but I'm just starting out with OpenMP due to increased data volumes.
I'm going through "Parallel Programming in Fortran 95 using OpenMP" by Miguel Hermanns and am very early in the book. One of the early examples shows the use of nested parallel regions and indicates that it should produce N2 + N lines of output. The procedure looks like this:
program helloworld
!$OMP PARALLEL
write(*,*) "Hello"
!$OMP PARALLEL
write(*,*) "Hi"
!$OMP END PARALLEL
!$OMP END PARALLEL
end program helloworldcode
I would expect 12 Hellos and 144 His, but instead I get 12 of each:
$ ./helloworld.exe
Hello
Hello
Hello
Hi
Hi
Hello
Hello
Hello
Hello
Hello
Hello
Hi
Hi
Hello
Hello
Hi
Hi
Hi
Hi
Hi
Hello
Hi
Hi
Hi
Why am I not getting the 156 lines of output that I would expect?
By default, OpenMP serializes all the nested parallel regions in order to prevent the worst case of quadratic over-subscription when N^2 worker threads are created. With big enough number of processors (say >=16), quadratic over-subscription can ruin execution with nightmare overheads or cause resource exhaustion issue when requested number of threads just cannot be created.
For information how to enable nested parallelism in OpenMP on your risk, please refer to omp_set_nested and corresponding environment variable OMP_NESTED.

changing thread number doesn't affect code

I am trying to learn xeon-phi , and while studying the Intel Xeon-Phi Coprocessor HPC book , I tried to run the code here. (from book)
The code uses openmp and 2 threads.
But the results I am taking are the same as running with 1 thread.
( no use of openmp at all )
I even used in mic different combinations but still the same:
export OMP_NUM_THREADS=2
export MIC_OMP_NUM_THREADS=124
export MIC_ENV_PREFIX=MIC
It seems that somehow openmp is not enabled?Am I missing something here?
The code using only 1 thread is here
I compiled using:
icc -mmic -openmp -qopt-report -O3 hello.c
Thanks!
I am not sure exactly which book you are talking about, but perhaps this will help.
The code you show does not use the offload programming style and must be run natively on the the coprocessor, meaning you copy the executable to the coprocessor and run it there or you use the micnativeloadex utility to run the code from the host processor. You show that you know the code must be run natively because you compiled it with the -mmic option.
If you use micnativeloadex, then the number of omp threads on the coprocessor is set by executing "export MIC_OMP_NUM_THREADS=124" on the host. If you copy the executable to the coprocessor and then log in to run it there, the number of omp threads on the coprocessor is set by executing "export OMP_NUM_THREADS=124" on the coprocessor. If you use "export OMP_NUM_THREADS=2" on the coprocessor, you get only two threads; the MIC_OMP_NUM_THREADS environment variable is not used if you set it directly on the coprocessor.
I don't see any place in the code where it prints out the number of threads, so I don't know for sure how you determined the number of threads actually being used. I suspect you were using a tool like micsmc. However micsmc tells you how may cores are in use, not how many threads are in use.
By default, the omp threads are laid out in order, so that the first core would run threads 0,1,2,3, the second core would run threads 4,5,6,7 and so on. If you are using only two threads, both threads would run on the first core.
So, is that what you are seeing - not that you are using only one thread but instead that you are using only one core?
I was looking at the serial version of the code you are using. For the following lines:
for(j=0; j<MAXFLOPS_ITERS; j++)
{
//
// scale 1st array and add in the 2nd array
// example usage - y = mx + b;
//
for(k=0; k<LOOP_COUNT; k++)
{
fa[k] = a * fa[k] + fb[k];
}
}
I see that here you do not scan the complete array. Instead you keep on updating the first 128 (LOOP_COUNT) elements of the array Fa. If you wish to compare this serial version to the parallel code you are referring to, then you will have to ensure that the program does same amount of work in both versions.
Thanks
I noticed three things in your first program omp:
the total floating point operations should reflect the number of threads doing the work. Therefore,
gflops = (double)( 1.0e-9*LOOP_COUNTMAXFLOPS_ITERSFLOPSPERCALC*numthreads);
You harded code the number of thread = 2. If you want to use the OMP env variable, you should comment out the API "omp_set_num_threads(2);"
After transferring the binary to the coprocessor, to set the OMP env variable in the coprocessor please use OMP_NUM_THREADS, and not MIC_OMP_NUM_THREADS. For example, if you want 64 threads to run your program in the coprocessor:
% ssh mic0
% export OMP_NUM_THREADS=64

accelerator directives not working

This is the code for a matrix multiplication
program ex
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sum
sum=0
do j = 1,256
do i = 1,256
a(i,j) = 1
b(i,j) = 1
c(i,j) = 0.0
enddo
enddo
call cpu_time(t1)
!$acc region do
do i=1,256
do j=1,256
sum=0
do k=1,256
sum=sum+a(i,k)*b(k,j)
c(i,j)=sum
end do
end do
end do
!$acc end region
call cpu_time(t2)
print*,"cpu time=",t2-t1
print*,c
end program ex
When I execute this the execution time is 75 msec when using the accelerator directives and the PGI compiler. But when I run same matrix multiplication with a "cuda fortran" implementation the execution time is only 5msec. So there is big difference even though I used the accelerator directives. So I doubt that my accelerator directives are working properly.
I tried to accelerate your program using very similar accelerator directives OpenHMPP. Note that I switched one your line, that is probably errorneously in the innermost loop. Also note, that I had to advice the compiler of the reduction taking place. Also I renamed the reduction variable, because it shadowed the sum intrinsic function.
The performance is not good, because of the overheead with starting the GPU kernel and because of the memory transfers. You need orders of magnitude more work for it to be profitable to use GPU.
For example when I used matrices 2000 x 2000 then the CPU execution time was 41 seconds, but GPU execution time only 8 s.
program ex
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sm
sm=0
do j = 1,256
do i = 1,256
a(i,j) = 1
b(i,j) = 1
c(i,j) = 0.0
enddo
enddo
call cpu_time(t1)
!$hmpp region, target = CUDA
!$hmppcg gridify, reduce(+:sm)
do i=1,256
do j=1,256
sm=0
do k=1,256
sm=sm+a(i,k)*b(k,j)
end do
c(i,j)=sm
end do
end do
!$hmpp endregion
call cpu_time(t2)
print*,"cpu time=",t2-t1
print*,sum(c)
end program ex
edit: it would be probably not to use reduce(+:sm), but just private(sm)
FYI, the OP also posted this question on the PGI User Forum (http://www.pgroup.com/userforum/viewtopic.php?t=3081). We believe the original issue was the result of pilot error. When we profiled his code using CUDA Prof, the CUDA Fortran kernel execution time was 205 ms versus 344 ms using the PGI Accelerator Model. Also, if I fix his code so that "c(i,j)=sum" is placed outside of the inner "k" loop, the PGI Accelerator Model time reduces to 123ms. It's unclear how he gathered his timings.
Thanks to those that tried to help.
- Mat