Parallelizing formated writes in Fortran with OpenMP - fortran

I'm trying to parallelize a Fortran code which at one moment writes a tons of numbers to formated output. Some simple profiling showed that most CPU time is spent in format conversion, so I had the idea to do the formatting in parallel to character buffers and later write the unformatted buffers to the file.
My proof of concept looks like this:
program parawrite
implicit none
integer (kind = 4) :: i, j, tstart, tstop, rate
integer (kind = 4), parameter :: bufsize = 100000, n = 10000000, llen = 22
character (kind=1, len=:), allocatable :: buf
real (kind=8), dimension(n) :: a
! some input
do i = 1, n
a(i) = dble(i) * dble(i)
enddo
! formated writes for reference
open(unit=10, file="out1.txt", form="formatted")
call system_clock(tstart, rate);
do i = 1, n
write(10,"(E21.15)") a(i)
end do
call system_clock(tstop, rate);
print *, 'Formated write: ', dble(tstop - tstart) / dble(rate), 's'
close(10)
! parallel stuff
open(unit=10, file="out2.txt", access="stream", form="unformatted")
call system_clock(tstart, rate);
!$omp parallel private(buf, j)
allocate(character(bufsize * llen) :: buf)
j = 0;
!$omp do ordered schedule(dynamic,bufsize)
do i = 1, n
write (buf(j*llen+1:(j+1)*llen),"(E21.15,A1)") a(i), char(10)
j = j + 1
if (mod(i, bufsize) == 0) then
!$omp ordered
write (10) buf
!$omp end ordered
j = 0
end if
end do
deallocate(buf)
!$omp end parallel
close(10)
call system_clock(tstop, rate);
print *, 'Parallel write: ', dble(tstop - tstart) / dble(rate), 's'
end program parawrite
When I run it, however, not only is the parallel version much slower when at single thread, it also doesn't scale too much...
$ gfortran -O2 -fopenmp writetest.f90
$ OMP_NUM_THREADS=1 ./a.out
Formated write: 11.330000000000000 s
Parallel write: 15.625999999999999 s
$ OMP_NUM_THREADS=6 ./a.out
Formated write: 11.331000000000000 s
Parallel write: 6.1799999999999997 s
My first question would be how to make it the same speed at single thread? The time spent writing the buffer to the file is negligible, so why are the writes to the buffer slower than when writing directly to file?
My second question is about why the scaling is so bad? I have an equivalent C code which uses sprintf and fwrite and there I can get almost perfect linear scaling (I can post the code if needed), however with Fortran I can only reduce runtime to around 40% at 6 threads (with C I can reduce it to 18% at the same number of threads). It is still faster than the serial version, but I hope this could be improved.

From some experiments, it seems that an internal file is rather slow if an array element is converted to an internal file one at a time. This is also the case for an external file, but the degree of slowdown seems much greater for internal files (for some reason...). So I've modified the code such that a set of array elements are converted at once and then written to an external file via stream output. Below, four patterns are compared:
Sequential (1): The original code (which writes each element via do-loop)
Sequential (2): Write an array at once (or via implied loop) to an external file
Parallel (1): Make an internal file for many elements and then write to an external file
Parallel (2): Simplest parallel code with formatted write or spirntf for each element
Among these, Parallel (2) + sprintf (marked with *2 in the code) was the fastest, while Parallel (2) + write for each element (marked with *1) was the slowest (timing shown as Parallel (*) in the table, which does not scale with OpenMP for some reason). I guess sprintf will be the fastest probably because of the least amount of internal checks and overhead etc (just a guess!)
Results (please see the bottom for the modified codes)
$ gcc -O3 -c conv.c && gfortran -O3 -fopenmp test.f90 conv.o
# Machine: Core i7-8550U (1.8GHz), 4-core/8-thread, Ubuntu18.04 (GCC7.3.0)
# Note: The amount of data has been reduced to 1/5 of the
# original code, n = bufsize * 20, but the relative
# timing results remain the same even for larger data.
$ OMP_NUM_THREADS=1 ./a.out
Sequential (1): 2.0080000000000000 s
Sequential (2): 1.6510000000000000 s
Parallel (1): 1.6960000000000000 s
Parallel (2): 1.2640000000000000 s
Parallel (*): 3.1480000000000001 s
$ OMP_NUM_THREADS=2 ./a.out
Sequential (1): 1.9990000000000001 s
Sequential (2): 1.6479999999999999 s
Parallel (1): 0.98599999999999999 s
Parallel (2): 0.72999999999999998 s
Parallel (*): 1.8600000000000001 s
$ OMP_NUM_THREADS=4 ./a.out
Sequential (1): 2.0289999999999999 s
Sequential (2): 1.6499999999999999 s
Parallel (1): 0.61199999999999999 s
Parallel (2): 0.49399999999999999 s
Parallel (*): 1.4470000000000001 s
$ OMP_NUM_THREADS=8 ./a.out
Sequential (1): 2.0059999999999998 s
Sequential (2): 1.6499999999999999 s
Parallel (1): 0.56200000000000006 s
Parallel (2): 0.41299999999999998 s
Parallel (*): 1.7689999999999999 s
main.f90:
program main
implicit none
integer :: i, j, k, tstart, tstop, rate, idiv, ind1, ind2
integer, parameter :: bufsize = 100000, n = bufsize * 20, llen = 22, ndiv = 8
character(len=:), allocatable :: buf(:), words(:)
character(llen + 1) :: word
real(8), allocatable :: a(:)
allocate( a( n ) )
! Some input
do i = 1, n
a(i) = dble(i)**2
enddo
!.........................................................
! Formatted writes (1).
open(unit=10, file="dat_seq1.txt", form="formatted")
call system_clock(tstart, rate);
do i = 1, n
write(10,"(ES21.15)") a(i)
end do
call system_clock(tstop, rate);
print *, 'Sequential (1):', dble(tstop - tstart) / dble(rate), 's'
close(10)
!.........................................................
! Formatted writes (2).
open(unit=10, file="dat_seq2.txt", form="formatted")
call system_clock(tstart, rate);
write( 10, "(ES21.15)" ) a
! write( 10, "(ES21.15)" ) ( a( k ), k = 1, n )
call system_clock(tstop, rate);
print *, 'Sequential (2):', dble(tstop - tstart) / dble(rate), 's'
close(10)
!.........................................................
! Parallel writes (1): make a formatted string for many elements at once
allocate( character( llen * bufsize / ndiv ) :: buf( ndiv ) )
open(unit=10, file="dat_par1.txt", access="stream", form="unformatted")
call system_clock(tstart, rate);
do i = 1, n, bufsize
!$omp parallel do private( idiv, ind1, ind2, k ) shared( i, buf, a )
do idiv = 1, ndiv
ind1 = i + (idiv - 1) * bufsize / ndiv
ind2 = ind1 + bufsize / ndiv - 1
write( buf( idiv ),"(*(ES21.15, A1))") &
( a( k ), char(10), k = ind1, ind2 )
enddo
!$omp end parallel do
write(10) buf
end do
call system_clock(tstop, rate);
print *, 'Parallel (1):', dble(tstop - tstart) / dble(rate), 's'
deallocate(buf)
close(10)
!.........................................................
! Parallel writes (2): sprintf vs write for each element
allocate( character( llen ) :: words( n ) )
open(unit=10, file="dat_par2.txt", access="stream", form="unformatted")
call system_clock(tstart, rate);
!$omp parallel do private( i, word ) shared( a, words )
do i = 1, n
! write( word, "(ES21.15, A1)" ) a( i ), char(10) !! slow (*1)
call conv( word, a( i ) ) !! sprintf (*2)
words( i ) = word( 1 : llen )
enddo
!$omp end parallel do
write( 10 ) words
call system_clock(tstop, rate);
print *, 'Parallel (2):', dble(tstop - tstart) / dble(rate), 's'
close(10)
end program
conv.c:
#include <stdio.h>
void conv_( char *buf, double *val )
{
sprintf( buf, "%21.15E\n", *val );
}

Related

Parallelizing DO loop with nvfortran on gpu

I am tring to parallelize a do loop in Fortran. Using OMP parallel do (and converted to standard do loop) it works nicely (using both gfortran and nvfortran), but when compiling it with nvfortran -stdpar=gpu it compiles, but running it, it crashes with:
0 Current file: xxx/pi.f90
function: pi
line: 15 This file was compiled: -acc=gpu -gpu=cc35 -gpu=cc50 -gpu=cc60 -gpu=cc60 -gpu=cc70 -gpu=cc75 -gpu=cc80 -
Here is the code:
program pi
implicit none
integer :: count, n, i
real :: r
real, dimension(10000) :: x,y
logical , dimension(10000) :: c
c = .false.
n=size(x,1)
print*,count(c)
call RANDOM_SEED
call random_number(x)
call random_number(y)
do concurrent (i = 1: n)
if (x(i)**2 + y(i)**2 <1.0) c(i)=.true.
end do
r = 4 * real(count(c))/n
print *, r
end program pi

MPI, Fortran, multitasking

I would like to perform many independent operations (e.g. time integration of an ODE with different initial conditions) using MPI and Fortran. The initial conditions are a 2$\times 1000$ vector IC for example.
do i=1,1000
(x0,y0) = (x(i),y(i))
Solve an ODE with (x0,y0) for a time duration
Save the result at the end of this duration
enddo
Can anyone help with a minimal code using MPI or a link to something similar.
I have already used OMP but I think with MPI I would have access to more CPUs
If the operations are truly independent (and the number of cases is a multiple of the number of processors) then:
call mpi_scatter to distribute start points from root
call
call mpi_gather to collect the results back on root
root can then write to file.
If the number of processors doesn't divide into the number of cases then you can use mpi_scatterv and mpi_gatherv instead.
Example (rather trivial work per job, rather than solving ODEs):
program main
use iso_fortran_env
use mpi
implicit none
integer stat(mpi_status_size), tag, ierr
integer size, rank
integer, parameter :: N = 256 * 1000 ! assumes this is a multiple of the number of processors
integer, parameter :: root = 0
integer myN
integer i
real(real64), allocatable :: Y(:), myY(:)
real(real64) start, finish
call mpi_init( ierr )
call mpi_comm_size( mpi_comm_world, size, ierr )
call mpi_comm_rank( mpi_comm_world, rank, ierr )
! Set initial values for full array, then start timing
if ( rank == root ) then
allocate( Y(N) )
Y = [ ( i + 0.0_real64, i = 1, N ) ]
start = gettime()
end if
! Root parcels out the work (i.e., distributes the starting points)
! Processor with rank r will look at indices 1+r*N/size to (r+1)*N/size
myN = N / size
allocate( myY(myN) )
call mpi_scatter( Y, myN, mpi_double_precision, &
myY, myN, mpi_double_precision, root, mpi_comm_world, ierr )
! Each processor does its own work
call myWork( myN, myY )
! Root gets its results back
call mpi_gather ( myY, myN, mpi_double_precision, &
Y, myN, mpi_double_precision, root, mpi_comm_world, ierr )
! Root concludes timing, then writes to file
if ( rank == root ) then
finish = gettime()
write( *, * ) "Time taken = ", finish - start
open( 10, file="output.txt" )
write( 10, "( i8, 1x, es11.4 )" ) ( i, Y(i), i = 1, N )
close( 10 )
deallocate( Y )
end if
deallocate( myY )
call mpi_finalize( ierr )
contains
subroutine myWork( N, Y )
integer , intent(in ) :: N
real(real64), intent(inout) :: Y(N)
integer i
do i = 1, 10000
Y = 2 * Y - Y ! silly example, just to use some flops
end do
end subroutine myWork
real(real64) function getTime()
integer t(8)
call date_and_time( values=t )
getTime = 3600 * t(5) + 60 * t(6) + t(7) + 0.001 * t(8)
end function getTime
end program main

Speedup of calculation for symmetric matrix using OMP

My matrix calculation is: C=C-A*B
Here C is a symmetric matrix so I want to speed up this calculation by considering just the upper triangular and then take the opposite elelement. I used OMP and see that my implementation is slower than the normal calculation for the entire matrix C.
I also see that the calculation for C=C-AxB is slower than C=C+AxB.
My program is attached. Please advise me!
Program testspeed
implicit none
integer nstate,nmeas,i,j,l
integer(kind=8) :: tclock1, tclock2, clock_rate
real(kind=8) :: elapsed_time
double precision, allocatable, dimension(:,:):: B,C,A
nstate =20000
nmeas=10000
allocate (B(nmeas,nstate),C(nstate,nstate),A(nstate,nmeas))
A=1d0
B=1d0
call system_clock(tclock1)
write(*,*) "1"
!$omp parallel do
do j = 1, nstate
do l = 1,nmeas
do i = 1, j
C(j,i) = C(j,i) - A(j,l)*B(l,i)
C(i,j)=C(j,i)
end do
end do
end do
!$omp end parallel do
write(*,*) "2"
call system_clock(tclock2, clock_rate)
elapsed_time = float(tclock2 - tclock1) / float(clock_rate)
write(*,*) elapsed_time
end Program testspeed
One of the basic rules I have taught my students is that nobody should be writing dense matrix multiplies themselves nowadays - and should not have been doing for 30 years +. You should use the BLAS library instead. Below I compare using the BLAS library against your loop ordering and a better loop ordering, and also against the Fortran intrinsic function matmul which I use as a reference to check the results are correct. BLAS and matmul don't take advantage of the symmetry of C, yet they still are the fastest routines - BLAS is about 200-300 times quicker than the loop ordering you have written. Note I have also cut the matrix size down somewhat as I got bored waiting for the original to run for larger cases:
ijb#ijb-Latitude-5410:~/work/stack$ cat mm.f90
Program testspeed
Use, Intrinsic :: iso_fortran_env, Only : wp => real64, li => int64
Use omp_lib, Only : omp_get_max_threads
Implicit None
Integer nstate,nmeas,i,j,l
Integer(li) :: tclock1, tclock2, clock_rate
Real(wp) :: elapsed_time
Real( wp ), Allocatable, Dimension(:,:):: B,C,A
Real( wp ), Allocatable, Dimension(:,:):: C_test
Real( wp ), Allocatable, Dimension(:,:):: C_start
Write( *, * ) 'Using ', omp_get_max_threads(), ' threads'
!!$ nstate =2000
!!$ nmeas=1000
nstate = 5000
nmeas = 2500
Allocate (B(nmeas,nstate),C(nstate,nstate),A(nstate,nmeas))
Allocate( C_test, Mold = C )
Allocate( C_start, Mold = C )
!!$ A=1.0_wp
!!$ B=1.0_wp
! Random numbers are a much better test
Call Random_number( A )
B = Transpose( A ) ! make sure result is symmetric
Call Random_number( C_start )
! Make Initial C Symmetric
C_start = 0.5_wp * ( C_start + Transpose( C_start ) )
Write( *, * ) 'Matix sizes ', Shape( A ), Shape( B ), Shape( C )
C_test = C_start
Call system_Clock(tclock1)
C_test = C_test - Matmul( A, B )
Call system_Clock(tclock2, clock_rate)
elapsed_time = Real(tclock2 - tclock1,wp) / Real(clock_rate,wp)
Write( *,'( a, t20, f8.3 )' ) 'Matmul', elapsed_time
C = C_start
Call system_Clock(tclock1)
!$omp parallel do
Do j = 1, nstate
Do l = 1,nmeas
Do i = 1, j
C(j,i) = C(j,i) - A(j,l)*B(l,i)
C(i,j)=C(j,i)
End Do
End Do
End Do
!$omp end parallel do
Call system_Clock(tclock2, clock_rate)
elapsed_time = Real(tclock2 - tclock1,wp) / Real(clock_rate,wp)
Write(*,'( a, t20, f8.3, t30, "Max error ", g20.12 )' ) &
'Orig loops', elapsed_time, Maxval( Abs( C_test - C ) )
C = C_start
Call system_Clock(tclock1)
!$omp parallel default( none ) shared ( nstate, nmeas, A, B, C ), private( i, j, l )
!$omp do
Do i = 1, nstate
Do l = 1,nmeas
Do j = 1, i
C(j,i) = C(j,i) - A(j,l)*B(l,i)
End Do
End Do
End Do
!$omp end do
!$omp do
Do i = 1, nstate
Do j = 1, i
C( i, j ) = C( j, i )
End Do
End Do
!$omp end do
!$omp end parallel
Call system_Clock(tclock2, clock_rate)
elapsed_time = Real(tclock2 - tclock1,wp) / Real(clock_rate,wp)
Write(*,'( a, t20, f8.3, t30, "Max error ", g20.12 )' ) &
'Sensible loops', elapsed_time, Maxval( Abs( C_test - C ) )
C = C_start
Call system_Clock(tclock1)
Call dgemm( 'N', 'N', nstate, nstate, nmeas, -1.0_wp, A, Size( A, Dim = 1 ), &
B, Size( B, Dim = 1 ), &
+1.0_wp, C, Size( C, Dim = 1 ) )
Call system_Clock(tclock2, clock_rate)
elapsed_time = Real(tclock2 - tclock1,wp) / Real(clock_rate,wp)
Write(*,'( a, t20, f8.3, t30, "Max error ", g20.12 )' ) &
'BLAS ', elapsed_time, Maxval( Abs( C_test - C ) )
End Program testspeed
ijb#ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb#ijb-Latitude-5410:~/work/stack$ gfortran -fopenmp -Wall -Wextra -std=f2018 -O3 mm.f90 -lopenblas
ijb#ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=1
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Using 1 threads
Matix sizes 5000 2500 2500 5000 5000 5000
Matmul 4.793
Orig loops 421.564 Max error 0.488853402203E-11
Sensible loops 20.742 Max error 0.488853402203E-11
BLAS 2.185 Max error 0.682121026330E-12
ijb#ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=2
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Using 2 threads
Matix sizes 5000 2500 2500 5000 5000 5000
Matmul 4.968
Orig loops 324.319 Max error 0.466116034659E-11
Sensible loops 17.656 Max error 0.466116034659E-11
BLAS 1.161 Max error 0.682121026330E-12
ijb#ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=3
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Using 3 threads
Matix sizes 5000 2500 2500 5000 5000 5000
Matmul 4.852
Orig loops 243.268 Max error 0.500222085975E-11
Sensible loops 15.802 Max error 0.500222085975E-11
BLAS 0.852 Max error 0.682121026330E-12
ijb#ijb-Latitude-5410:~/work/stack$ export OMP_NUM_THREADS=4
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Using 4 threads
Matix sizes 5000 2500 2500 5000 5000 5000
Matmul 4.994
Orig loops 189.189 Max error 0.477484718431E-11
Sensible loops 14.245 Max error 0.477484718431E-11
BLAS 0.707 Max error 0.682121026330E-12
For BLAS I have used openblas - which is freely available. On Linux system a simple apt get or similar should be enough.
Please also note
If you have to write your own loops your inner most loop should go, if possible, over the first index of your array. This is because Fortran orders its arrays as column major. This is what I have done in the "sensible" loop ordering
Real( 8 ) is not portable, not guaranteed to be supported by your compiler, and not guaranteed to do what you expect, and shouldn't be used. Similar for Integer( 8 ). Please see what I have done for a better way that should work with all compilers.
Float is not a standard intrinsic - use Real as I have done
As benchmarks are meaningless if the results are incorrect you should always include a way to test the results. Here I use the Fortran intrinsic matmul to provide a reference version. Your original code does not initialise C, so the results can not be trusted - but as you don't check you get the correct values for C you can't know this.
I personally dislike !$omp parallel do intensely, I think it a mistake that such short cuts are in OpenMP. Instead separate them into !$omp parallel and !$omp do - it is very important to understand that thread creation and work sharing are different things, convoluting them in one line is not a good way to learn OpenMP.

Unclassifiable OpenMP directive in a Fortran program

I was trying to parallelize a code in Fortran using openMP, with this code:
program pigreco
!----------------------------------------!
use OMP_LIB
implicit none
!----------------------------------------!
integer :: i
integer, parameter :: N = 100000
integer, parameter :: NCPU = 4
real*8 :: t0, t1
real :: h, totale, x, f
!----------------------------------------!
print '(a,2x,i15)', ' Number of intervals: ', N
totale = 0.0
h = 1. / N
call OMP_SET_NUM_THREADS(NCPU)
write(*, '(a,i10)') 'Numero di processori totali: ', NCPU
t0 = OMP_GET_WTIME()
!----------------------------------------!
#ifdef PARALLEL
!
print '(a)', "Scelta la versione parallela."
!
!$OMP PARALLEL DO PRIVATE(x, f) REDUCTION(+:totale)
!
do i = 1, N
x = (i - 0.5) * h
f = (4 * h) / (1 + x**2)
totale = totale + f
enddo
!$OMP END PARALLEL DO
!
#endif
!
t1 = OMP_GET_WTIME()
!
PRINT '(a,2x,f30.25)', ' Computed PI =', totale
PRINT '(a,2x,f30.25)', ' Total computational time =', t1 - t0
!
end program pigreco
When I then try to compile with the line: gfortran prova.F90 -fopenmp -D PARALLEL it gives me an error that says "unclassifiable OpenMP directive at (1)".
The problem is that you defined PARALLEL with the preprocessor, so instead of reading OMP PARALLEL DO, the compiler reads OMP 1 DO, which of course doesn't make sense. Change #ifdef PARALLEL to #ifdef RUNPARALLEL and -DPARALLEL to -DRUNPARALLEL, then the compiler gives no error.
Alternatively, you can use the fact that when compiling with OpenMP support the macro variable _OPENMP is defined automatically, so you could use #ifdef _OPENMP, and no -D flag.

Is there a straightforward way to do concurrent calculations with a random number generator in Fortran?

I have some finite element code I have programmed in Fortran 95 that I have optimised so that I can now get well over 16Mil. elements working under 2GB of memory footprint.
The source function for my code is not smooth so I am using a (stratified) Monte-Carlo method to integrate, which requires a random number generator to select sample locations
I have tried compiling with gfortran-9 using -fopenmp -Ofast -ftree-parallelize-loops=4 but the loop with the random number generator won't go parallel. I tried do concurrent but obviously that didn't work because random_number isn't 'pure'. https://stackoverflow.com/a/32637737/2372254
I also tried blocking my loop but that didn't work either.
Here is the code I am talking about
do k=1,n_els ! total elements is n_els**2. This is block
do i=1+ (k-1)*n_els ,k*n_els
supp_vec = 0
integ_vec = 0.0_wp
! in this subroutine I call random_number
call do_element(ind, n_els, i, num_points_per_strat, &
strat_rows, strat_cols, supp_vec, integ_vec)
do j=1, 4
sc_vec(supp_vec(j) ) = integ_vec(j)
end do
! give some info about progress
if (mod( i , (n_els**2)/10) == 0) print*, i*10/((n_els**2)/10), "% done"
end do
end do
It seems I could write blocks to a file and call n different instances of the routine. I figure there must be a cleaner way to do that. Any tips on how to get that going faster?
I was considering writing a block-worth of points (depending on memory limits) to an array first and supplying that in the subroutine call. Before I try that I thought I would see if anybody had any advice about a better way. It would be good to keep the memory footprint down where possible.
As of version 7 and newer GFortran has a parallel random number generator. When implementing it, here's the OpenMP code I used to verify that the performance indeed scales with increasing numbers of threads (from https://gcc.gnu.org/ml/gcc-patches/2015-12/msg02110.html ):
! Benchmark generating random numbers
! Janne Blomqvist 2015
program randbench
#ifdef _OPENMP
use omp_lib
#endif
implicit none
integer, parameter :: dp=kind(0.d0) ! double precision
integer, parameter :: i64 = selected_int_kind(18) ! At least 64-bit integer
#ifdef _OPENMP
print *, "Using up to ", omp_get_max_threads(), " threads."
#endif
call genr4
call genr8
contains
subroutine genr4
integer, parameter :: n = int(1e7)
real, save :: r(n)
integer :: i
integer(i64) :: t1, t2, td
#ifdef _OPENMP
integer :: blocks, blocksize, l, h
#endif
Print *, "Generate default real random variables"
call system_clock (t1)
!$omp parallel do private(i)
do i = 1, n
call random_number(r(i))
end do
!$omp end parallel do
call system_clock (t2)
td = t2 - t1
print *, "Generating ", n, " default reals individually took ", td, " ticks."
call system_clock (t1)
#ifdef _OPENMP
blocks = omp_get_max_threads()
blocksize = n / blocks
!$omp parallel do private(l,h,i)
do i = 0, blocks - 1
l = i * blocksize + 1
h = l + blocksize - 1
!print *, "Low: ", l, " High: ", h
call random_number(r(l:h))
end do
#else
call random_number(r)
#endif
Call system_clock (t2)
print *, "Generating ", n, " default reals as an array took ", t2-t1, &
" ticks. => ind/arr = ", real(td, dp) / (t2-t1)
end subroutine genr4
subroutine genr8
integer, parameter :: n = int(1e7)
real(dp), save :: r(n)
integer :: i
integer(i64) :: t1, t2, td
#ifdef _OPENMP
integer :: blocks, blocksize, l, h
#endif
print *, "Generate double real random variables"
call system_clock (t1)
!$omp parallel do
do i = 1, n
call random_number(r(i))
end do
call system_clock (t2)
td = t2 - t1
print *, "Generating ", n, " double reals individually took ", td, " ticks."
call system_clock (t1)
#ifdef _OPENMP
blocks = omp_get_max_threads()
blocksize = n / blocks
!$omp parallel do private(l,h,i)
do i = 0, blocks - 1
l = i * blocksize + 1
h = l + blocksize - 1
!print *, "Low: ", l, " High: ", h
call random_number(r(l:h))
end do
#else
call random_number(r)
#endif
call system_clock (t2)
print *, "Generating ", n, " double reals as an array took ", t2-t1, &
" ticks. => ind/arr = ", real(td, dp) / (t2 -t1)
end subroutine genr8
end program