I have recently started to parallelize a serial code I've been developing and was curious if anyone had input on how to properly apply OpenMP for these loops.
F_vol = 0.0_bp
G_vol = 0.0_bp
H_vol = 0.0_bp
RHS_vol = 0.0_bp
!$OMP PARALLEL DO PRIVATE(e,i1,j1,F,G,H)
DO e=1,NE
DO i1=1,Np
DO j1=1,NPTS
CALL Flux(Q(j1,e,:),F,G,H)
F_Vol(i1,e,:) = F_Vol(i1,e,:) + Stuff(j1)*F(:)
G_Vol(i1,e,:) = G_vol(i1,e,:) + Stuff(j1)*G(:)
H_Vol(i1,e,:) = H_vol(i1,e,:) + Stuff(j1)*H(:)
END DO
END DO
END DO
!$OMP END PARALLEL DO
As a note, the arrays, F, G, and H are of size 5 and are temporary arrays. Additionally, F_Vol,G_Vol,H_Vol is of dimension (NE,Np,5) The part I am unsure on is, how to properly parallelize the arrays I sum from j1=1,NPTS. Given that they are not dependent on each other but vary between i1,e, I think using PRIVATE() is required. as to avoid overwriting. Lastly, I am fixated on these loops as according to GProf, a good portion of my computational expense is in this area of code.
Related
I've tried to parallelize a code contains such a double do-loop. It's not efficient for sure, but that's not a big problem now.
The output tauv is NaN. That is the first problem.
The second problem is that Intel compiler gives fatal error with number of threads less than maximum number of threads (equals 8 for my machine).
How could I treat those problems?
!$omp parallel do private(i,j, ro11,ro21,ro12,ro22, &
u11,u21,u12,u22, &
v11,v21,v12,v22, &
es11,es21,es12,es22, &
p11,p21,p12,p22, &
te11,te21,te12,te22, &
emu11,emu21,emu12,emu22) &
shared(i1l, i2l, j1l, j2l, emumax, tauv, tauvij, ro, u, v, es)
do i=i1l+2,i2l-2,2
do j=j1l+2,j2l-2,2
if (i.le.niii.and.i.ge.0.and.j.ge.0.and.j.le.nj.or.&
i.le.ni.and.i.ge.niik.and.j.gt.njjv.and.j.le.nj.or.&
i.le.ni.and.i.ge.niik.and.j.ge.0.and.j.lt.njjn&
.or.i.gt.niii.and.i.lt.niik.and.j.gt.njj0+i-niii&
.or.i.gt.niii.and.i.lt.niik.and.j.lt.njj0-i+niii) then
ro11=ro(i-1,j-1)
ro21=ro(i+1,j-1)
ro12=ro(i-1,j+1)
ro22=ro(i+1,j+1)
u11=u(i-1,j-1)
u21=u(i+1,j-1)
u12=u(i-1,j+1)
u22=u(i+1,j+1)
v11=v(i-1,j-1)
v21=v(i+1,j-1)
v12=v(i-1,j+1)
v22=v(i+1,j+1)
es11=es(i-1,j-1)
es21=es(i+1,j-1)
es12=es(i-1,j+1)
es22=es(i+1,j+1)
p11=(es11-0.5*ro11*(u11*u11+v11*v11))*ga1
p21=(es21-0.5*ro21*(u21*u21+v21*v21))*ga1
p12=(es12-0.5*ro12*(u12*u12+v12*v12))*ga1
p22=(es22-0.5*ro22*(u22*u22+v22*v22))*ga1
te11=p11/ro11
te21=p21/ro21
te12=p12/ro12
te22=p22/ro22
emu11=te11**1.5*(1.0+s1)/(te11+s1)
emu21=te21**1.5*(1.0+s1)/(te21+s1)
emu12=te12**1.5*(1.0+s1)/(te12+s1)
emu22=te22**1.5*(1.0+s1)/(te22+s1)
emumax=emu11
if (emu21.gt.emumax) then
emumax=emu21
end if
if (emu12.gt.emumax) then
emumax=emu12
end if
if (emu22.gt.emumax) then
emumax=emu22
end if
tauvij=re*flkv*hx*hx/emumax
if (tauvij .le. tauv) then
tauv=tauvij
endif
endif
enddo
enddo
!$omp end parallel do
The thing is that it executes without error, but OpenMP do-loop computes more slowly than sequential one...
From your reproducible example:
1.) Your code is only using 1 thread (?) in OpenMP region:
! Set number of threads
nthreads = 1
call omp_set_num_threads(nthreads)
print *, 'The number of threads are used is ', omp_get_max_threads ( )
I would avoid the call omp_set_num_threads(). Insted, specify number of threads with environmental variable OMP_NUM_THREADS. For unix machine: export OMP_NUM_THREADS=<number of threads>
2.) In your "reproducible" example, the parallelized loop (line 312) is missing private/shared declarations? From what you wrote above, fix to:
!$omp parallel do default(private) shared(i1l, i2l, j1l, j2l, emumax, tauv, tauvij, ro, u, v, es)
With all of the above, the result I get from my machine (4c/4t) using GNU Fortran compiler is:
...
Executed time in SEQ code is 60.2720146
...
Executed time in OMP code is 27.1342430
I am converting f77 code to f90 code, and part of the code needs to sum over elements of a 3d matrix. In f77 this was accomplished by using 3 loops (over outer,middle,inner indices). I decided to use the f90 intrinsic sum (3 times) to accomplish this, and much to my surprise the answers differ. I am using the ifort compiler, have debugging, check-bounds, no optimization all turned on
Here is the f77-style code
r1 = 0.0
do k=1,nz
do j=1,ny
do i=1,nx
r1 = r1 + foo(i,j,k)
end do
end do
end do
and here is the f90 code
r = SUM(SUM(SUM(foo, DIM=3), DIM=2), DIM=1)
I have tried all sorts of variations, such as swapping the order of the loops for the f77 code, or creating temporary 2D matrices and 1D arrays to "reduce" the dimensions while using SUM, but the explicit f77 style loops always give different answers from the f90+ SUM function.
I'd appreciate any suggestions that help understand the discrepancy.
By the way this is using one serial processor.
Edited 12:13 pm to show complete example
! ifort -check bounds -extend-source 132 -g -traceback -debug inline-debug-info -mkl -o verify verify.f90
! ./verify
program verify
implicit none
integer :: nx,ny,nz
parameter(nx=131,ny=131,nz=131)
integer :: i,j,k
real :: foo(nx,ny,nz)
real :: r0,r1,r2
real :: s0,s1,s2
real :: r2Dfooxy(nx,ny),r1Dfoox(nx)
call random_seed
call random_number(foo)
r0 = 0.0
do k=1,nz
do j=1,ny
do i=1,nx
r0 = r0 + foo(i,j,k)
end do
end do
end do
r1 = 0.0
do i=1,nx
do j=1,ny
do k=1,nz
r1 = r1 + foo(i,j,k)
end do
end do
end do
r2 = 0.0
do j=1,ny
do i=1,nx
do k=1,nz
r2 = r2 + foo(i,j,k)
end do
end do
end do
!*************************
s0 = 0.0
s0 = SUM(SUM(SUM(foo, DIM=3), DIM=2), DIM=1)
s1 = 0.0
r2Dfooxy = SUM(foo, DIM = 3)
r1Dfoox = SUM(r2Dfooxy, DIM = 2)
s1 = SUM(r1Dfoox)
s2 = SUM(foo)
!*************************
print *,'nx,ny,nz = ',nx,ny,nz
print *,'size(foo) = ',size(foo)
write(*,'(A,4(ES15.8))') 'r0,r1,r2 = ',r0,r1,r2
write(*,'(A,3(ES15.8))') 'r0-r1,r0-r2,r1-r2 = ',r0-r1,r0-r2,r1-r2
write(*,'(A,4(ES15.8))') 's0,s1,s2 = ',s0,s1,s2
write(*,'(A,3(ES15.8))') 's0-s1,s0-s2,s1-s2 = ',s0-s1,s0-s2,s1-s2
write(*,'(A,3(ES15.8))') 'r0-s1,r1-s1,r2-s1 = ',r0-s1,r1-s1,r2-s1
stop
end
!**********************************************
sample output
nx,ny,nz = 131 131 131
size(foo) = 2248091
r0,r1,r2 = 1.12398225E+06 1.12399525E+06 1.12397238E+06
r0-r1,r0-r2,r1-r2 = -1.30000000E+01 9.87500000E+00 2.28750000E+01
s0,s1,s2 = 1.12397975E+06 1.12397975E+06 1.12398225E+06
s0-s1,s0-s2,s1-s2 = 0.00000000E+00-2.50000000E+00-2.50000000E+00
r0-s1,r1-s1,r2-s1 = 2.50000000E+00 1.55000000E+01-7.37500000E+00
First, welcome to StackOverflow. Please take the tour! There is a reason we expect a Minimal, Complete, and Verifiable example because we look at your code and can only guess at what might be the case and that is not too helpful for the community.
I hope the following suggestions helps you figure out what is going on.
Use the size() function and print what Fortran thinks are the sizes of the dimensions as well as printing nx, ny, and nz. As far as we know, the array is declared bigger than nx, ny, and nz and these variables are set according to the data set. Fortran does not necessarily initialize arrays to zero depending on whether it is a static or allocatable array.
You can also try specifying array extents in the sum function:
r = Sum(foo(1:nx,1:ny,1:nz))
If done like this, at least we know that the sum function is working on the exact same slice of foo that the loops loop over.
If this is the case, you will get the wrong answer even though there is nothing 'wrong' with the code. This is why it is particularly important to give that Minimal, Complete, and Verifiable example.
I can see the differences now. These are typical rounding errors from adding small numbers to a large sum. The processor is allowed to use any order of the summation it wants. There is no "right" order. You cannot really say that the original loops make the "correct" answer and the others do not.
What you can do is to use double precision. In extreme circumstances there are tricks like the Kahan summation but one rarely needs that.
Addition of a small number to a large sum is imprecise and especially so in single precision. You still have four significant digits in your result.
One typically does not use the DIM= argument, that is used in certain special circumstances.
If you want to sum all elements of foo, use just
s0 = SUM(foo)
That is enough.
What
s0 = SUM(SUM(SUM(foo, DIM=3), DIM=2), DIM=1)
does is that it will make a temporary 2D arrays with each element be the sum of the respective row in the z dimension, then a 1D array with each element the sum over the last dimension of the 2D array and then finally the sum of that 1D array. If it is done well, the final result will be the same, but it well eat a lot of CPU cycles.
The sum intrinsic function returns a processor-dependant approximation to the sum of the elements of the array argument. This is not the same thing as adding sequentially all elements.
It is simple to find an array x where
summation = x(1) + x(2) + x(3)
(performed strictly left to right) is not the best approximation for the sum treating the values as "mathematical reals" rather than floating point numbers.
As a concrete example to look at the nature of the approximation with ifort, we can look at the following program. We need to enable optimizations here to see effects; the importance of order of summation is apparent even with optimizations disabled (with -O0 or -debug).
implicit none
integer i
real x(50)
real total
x = [1.,(EPSILON(0.)/2, i=1, SIZE(x)-1)]
total = 0
do i=1, SIZE(x)
total = total+x(i)
print '(4F17.14)', total, SUM(x(:i)), SUM(DBLE(x(:i))), REAL(SUM(DBLE(x(:i))))
end do
end program
If adding up in strict order we get 1., seeing that anything smaller in magnitude than epsilon(0.) doesn't affect the sum.
You can experiment with the size of the array and order of its elements, the scaling of the small numbers and the ifort floating point compilation options (such as -fp-model strict, -mieee-fp, -pc32). You can also try to find an example like the above using double precision instead of default real.
Is it possible to assign a variable once and only once at the start of a loop iteration via some kind of pattern in Fortran?
Similar to the pseudocode below. What I have now is an If statement that checks if a flag is set and if it is we execute the if statement. Then we unset it, since my program is parallel I would like to avoid conditional branching in my innermost loop. From what I have seen it is possible to do in C++, but I wonder if I can achieve the same thing in Fortran somehow.
What I am looking for:
!$OMP DO PRIVATE(variable)
DO i = 0, N = 100000
<Set Variable to a fixed value once and only once at the start of the iteration>
<CODE>
END DO
!$END OMP DO
What I have
!$OMP DO PRIVATE(variable)
DO i = 1, N = 100000
IF (FLAG_IS_SET) THEN
<Set variable>
<UNSET_THE_FLAG>
END IF
<CODE>
END DO
!$END OMP DO
A simple solution is manually work-sharing the iterations across threads. Assuming that 100000 is multiple of n_threads:
!$OMP PARALLEL PRIVATE(variable)
i_thread = omp_get_thread_num()
n_threads = omp_get_num_threads()
chunk_size = 100000 / n_threads
i_start = i_thread * chunk_size + 1
<Set variable>
DO i = i_start, i_start + chunk_size - 1
<CODE>
END DO
!$OMP END PARALLEL
It avoids the conditional branching in the innermost loop and it is probably close to what the compilers do with the static scheduling.
It is unclear to me what you are trying to do.
If you want to do what you say you do ("assign a variable once and only once at the start of a loop iteration") that falls out completely naturally as
!$OMP DO PRIVATE(variable)
DO i = 0, N = 100000
variable = f(i) ! Assign value to the local variable at the start of each iteration
<CODE>
END DO
!$END OMP DO
Since variable is thread private, there is a copy in each thread and you assign each instance of the variable once.
But this seems so easy that maybe I don't understand what you're really trying to do!
I'm trying to pass 3D arrays to all other processes (in FORTRAN 77) using MPI_Bcast. v1 is a common block array. I'm also not sure if I need to broadcast the calculated values of the common array v1 to all other processes or they will be changed in each processes because of being common. The following is the related piece of code:
parameter (nprocz=48,nzro=1)
do i=i101,i102
dist = 0.015*float(i-iv0)
adamp = exp(-dist*dist)
do j = je0, je1-1
do k = ke0, ke1
v1(k,j,i) = v1(k,j,i)*adamp
end do
end do
end do
nmpi01=floor((iv0-ie0-nzro)/(nprocz-1))
if (mpirank .le. nprocz-2) then
i101=ie0+(mpirank*nmpi01)
i102=ie0+(mpirank+1)*nmpi01-1
else
i101=ie0+(mpirank*nmpi01)
i102=iv0-1
endif
MPI_Bcast(v1(:,:,i101:i102),(ke1-ke0+1)*(je1-je0)*(i102-i101+1)
& ,MPI_FLOAT,mpirank,MPI_COMM_WORLD,ierr01)
I get the error message:
PGFTN-S-0081-Matrix/vector v1 illegal as subprogram argument
The sizes of the arrays being passed in are correct. Any comment?
I corrected the code and I looped over the ranks and compute all elements of rcount and displs in each rank:
integer :: myscount, myi101
do rank = 0, nprocz-1
nmpi01=floor((iv0-ie0-nzro)/(nprocz-1))
if (rank .le. nprocz-2) then
i101=ie0+(rank*nmpi01)
i102=ie0+(rank+1)*nmpi01-1
else
i101=ie0+(rank*nmpi01)
i102=iv0-1
endif
scount=(i102-i101+1)*(je1-je0)*(ke1-ke0+1)
rcount(rank+1)=scount
displs(rank+1)=rank*scount+1
if (rank .eq. mpirank) then
myscount = scount
myi101 = i101
end if
end do
scount = myscount
i101 = myi101
call mpi_allgatherv(...)
But still wrong results. 1-in my case, results at each part are used for the next part, especially after mpi_allgatherv.so do i need to add mpi_barrier after each mpi_allgatherv? 2-should mpi_in_place be used? consider i have only one 3d array v1 that each sub-array v1(1,1,i) is calculated by some process and i want to put the calculated subarray in the appropriate part of the same array. 3- i guess i should have displs(i) = sum(rcount(1:i-1))+1 for i=>2 considering that always displs(1)=1 in fortran77. so i corrected to this: before the loop displs(1)=1, inside the loop displs(rank+2)=rank*scount+1 and after the loop displs(nprocz+1)=0. am I right?
As I recall, Fortran 77 was more restrictive about array subscripts than Fortran 90, and pgftn is a Fortran 77 compiler. I would try passing v1(1,1,i101) to mpi_bcast, not v1(:,:,i101:i102). (Or use pgf95 with the "-Mfixed" flag.)
If each process needs to see v1, then you do need to communicate it using MPI. No variable is shared between MPI tasks, not even those in a common block. However, if every process is calculating a different part of v1, so every process needs a piece from every other process, you can't use mpi_bcast to do that; use mpi_allgather instead.
Also, as noted above, when you use MPI procedures, you should call them, because they are subroutines.
So reserve is quite useful when you have a rough idea of your size requirements. Does anyone know of a similar method to pre-allocate arrays in MATLAB?
I'm not really interested in hacky (but effective) methods like the following:
x = zeros(1000,1);
for i = 1:10000
if i > numel(x)
x = [x;zeros(size(x))];
end
x(i) = rand;
end
x(i+1:end) = [];
The "hacky" way is the only way to do it. However, you do not need to check i <= numel(x). The array will be expanded automatically (but without array doubling):
x = zeros(1000,1);
for i = 1:10000
x(i) = rand;
end
x(i+1:end) = [];
EDIT: To keep it simple while still retaining the array doubling, you can write a class, or simply a few helper functions (below).
EDIT2: The usage of helper functions will slow things down compared to the manual hack. In MATLAB 2010 it is still much faster than naive growth. In MATLAB 2011 the naive approach is actually faster, suggesting that this version has smarter allocation. Perhaps it is fast enough so that no hack is needed at all. Thanks to Andrew Janke for pointing this out.
function listtest()
n = 10000;
l = new_list();
for i=1:n
l = list_append(l, i);
end
a = list_to_array(l);
end
function l = new_list()
l = [0 0];
end
function l = list_append(l, e)
if l(1)+1 == length(l)
l(length(l)*2) = 0;
end
l(1) = l(1)+1;
l(l(1)+1) = e;
end
function a = list_to_array(l)
a = l(2:1+l(1));
end
EDIT (from AndrewJanke)
Here's code to compare the speed of the implementations.
function manual_reserve_example(n)
x = zeros(1000,1);
for i = 1:n
if i > numel(x)
x = [x;zeros(size(x))];
end
x(i) = i;
end
x(i+1:end) = [];
end
function naive_growth(n)
x = 0;
for i = 1:n
x(i) = i;
end
end
function compare_them(n)
fprintf('Doing %d elements in Matlab R%s\n', n, version('-release'));
tic;
naive_growth(n);
fprintf('%30s %.6f sec\n', 'naive_growth', toc);
tic;
manual_reserve_example(n);
fprintf('%30s %.6f sec\n', 'manual_reserve', toc);
tic;
listtest(n);
fprintf('%30s %.6f sec\n', 'listtest', toc);
end
The cleanest solution to the example that you provided is to iterate backwards.
for i = 10000:-1:1
x(i) = rand;
end
This does not work in cases where the end size is actually unknown, but it has come in handy for me more often than I would have expected.
Otherwise I usually implement a "double on overflow" algorithm like you show in the original question.
The clean solution is to wrap a Matlab class around a respectible vector resize algorithm, and then use that class. I am not aware of any reason such a class could not be built, but I've never actually sat down and tried to implement one. (I'm curious if an example already exists on the file exchange somewhere.)
There is a way to preallocate memory for a structure in MATLAB 7.6 (R2008a) using the STRUCT and REPMAT commands.
EXAMPLE 1: A structure with two fields
s.field1
s.field2
s = struct('field1',cell(1),'field2',cell(1));
EXAMPLE 2: A structure with a field with a subfield
s.field1.subfield
s = struct('field1',struct('subfield',cell(1)));
EXAMPLE 3: An array of structures
v(1).field1
...
v(100).field1
s = struct('field1',cell(1));
v = repmat(s,100,1);