Using OpenMP critical and ordered - fortran

I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.
The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.
Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.
So my question is: why doesn't a critical section work in this case?
My code is below. All shared variables except np, tm, hm, gam are read-only.
EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!
TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.
!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
DO i=1,nd
!$OMP CRITICAL (main)
! Second loop over the data:
DO j=1,nd
! The lag:
zdis = z(j) - z(i)
IF(zdis >= 0.0) THEN
izl = INT( zdis/dzlag+0.5)
ELSE
izl = -INT(-zdis/dzlag+0.5)
END IF
! ---- SNIP ----
! Loop over all variograms for this lag:
DO cur_variogram=1,nvarg
variogram_type = ivtype(cur_variogram)
! Get the head and tail values:
indx = i+(ivhead(cur_variogram)-1)*maxdim
vrh = vr(indx)
indx = j+(ivtail(cur_variogram)-1)*maxdim
vrt = vr(indx)
IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE
! ----- PROBLEM AREA -------
np(ixl,iyl,izl,1) = np(ixl,iyl,izl,1) + 1. ! <-- This never fails
tm(ixl,iyl,izl,1) = tm(ixl,iyl,izl,1) + vrt
hm(ixl,iyl,izl,1) = hm(ixl,iyl,izl,1) + vrh
gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
! ----- END OF PROBLEM AREA -----
!CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
END DO
END DO
!$OMP END CRITICAL (main)
END DO
!$OMP END DO
!$OMP END PARALLEL
Thanks very much in advance!

If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.
Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.
The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.
I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.

High Performance Mark makes an interesting point about floating point and associativity. This can easily be tested (in C).
float a = -1.0E8f, b = 1.0E8f, c = 1.23456f;
printf("sum %f\n", a+b+c); //output 1.234560
printf("sum %f\n", a+(b+c)); //output 0.000000
But I would like to point out it is possible to preserve order in OpenMP. I discussed this here C++ OpenMP: Split for loop in even chunks static and join data at the end
Edit:
Actually, I confused commutativity and associativity. If you have an operator which is associative but not commuative than it's possible to preserve the order with OpenMP as I did in the post above. However, IEEE floating point is commutative but NOT asssociative so the only thing that came be done is to break IEEE and let it be associative.

Related

Parallel simulation gives different results after some time steps when compared with serial and additional parallel runs

I am trying to run a code on vortex simulations in parallel using OpenMP. These are similar to particle simulations where at each time step, the position of a vortex at the next time step has to be computed from its velocity which is determined by the positions of all the other vortices at the current time step. The vortices are deleted once they leave the domain. I compare the number of vortices at each time step for the parallel version of code with the serial version of code, and run each version multiple times.
For the serial versions, vortex counts match exactly at every time step. For the parallel case, all the runs match with the serial case for a few tens of time steps, post which, each parallel run shows a difference but remains within a 7-10% error bound with the serial case (as can be seen in the result link below). I know that this may be because of the round off errors in the parallel case owing from the difference in the order of computational steps due to distribution among the different threads, but should the error really be so high as 10%?
I have only used the reduction clause in a parallel do construct. The only parallel region in the whole code is within a function vblob(), which is inside a module, which I call from a main code. All function calls within vblob() are ixi(), fxi() are outside this module.
function vblob(blobs,xj,gj)
complex(8), intent(in) :: blobs(:,:), xj
complex(8) :: delxi, delxic, di, gvic, xi
real(8), intent(in) :: gj
real(8) :: vblob(2)
integer :: p
gvic = 0.0; delxi = 0.0; delxic = 0.0; di = 0.0; xi = 0.0
!$omp parallel do private(xi,delxic,delxi,di) shared(xj) reduction(+:gvic)
do p = 1, size(blobs,1)
xi = ixi(blobs(p,1))
delxic = xj-conjg(xi)
delxi = xj-xi
di = del*fxi(xi)
gvic = gvic + real(blobs(p,2))*1/delxic
if (abs(delxi) .gt. 1.E-4) then
gvic = gvic + (-1)*real(blobs(p,2))*1/delxi
end if
end do
!$omp end parallel do
gvic = j*gvic*fxi(xj)/(2*pi)
vblob(1) = real(gvic)
vblob(2) = -imag(gvic)
end function vblob
If the way I have constructed the parallel code is wrong, then errors should show up from the first few time steps itself, right?
(As can be seen in this result, the 'blobs' and 'sheets' are just types of vortex elements, the blue line is the total elements. P and S stand for Parallel and serial respectively and R stands for runs. THe solid plot markers are the serial code and the hollow ones are the three runs of the parallel code)
EDIT: When i change the numerical precision of my variables to real(4) instead, the divergenec in results happens at an earlier time step than the real(8) case above. SO its clearly a round off error issue.
TLDR: I want to clarify this with anyone else who has seen such a result over a range of time steps, where the parallel code matches for the first few time steps and then diverges?
Your code essentially sums up a lot of terms in gvic. Floating-point arithmetic is not associative, that is, (a+b)+c is not the same as a+(b+c) due to rounding. Also, depending on the values and the signs on the terms, there may be a serious loss of precision in each operation. See here for a really mandatory read on the subject.
While the sequential loop computes (given no clever compiler optimisations):
gvic = (...((((g_1 + g_2) + g_3) + g_4) + g_5) + ...)
where g_i is the value added to gvic by iteration i, the parallel version computes:
gvic = t_0 + t_1 + t_2 + ... t_(#threads-1)
where t_i is the accumulated private value of gvic in thread i (threads in OpenMP are 0-numbered even in Fortran). The order in which the different t_is are reduced is unspecified. The OpenMP implementation is free to choose whatever it deems fine. Even if all t_is are summed in order, the result will still differ from the one computed by the sequential loop. Unstable numerical algorithms are exceptionally prone to producing different results when parallelised.
This is something you can hardly avoid completely, but instead learn to control or simply live with its consequences. In many cases, the numerical solution to a problem is an approximation anyway. You should focus on conserved or statistical properties. For example, an ergodic molecular dynamics simulation may produce a completely different phase trajectory in parallel, but values such as the total energy or the thermodynamic averages will be pretty close (unless there is some serious algorithmic error or really bad numerical instability).
A side note - you are actually lucky to enter this field now, when most CPUs use standard 32- and 64-bit floating-point arithmetic. Years ago, when x87 was a thing, floating-point operations were done with 80-bit internal precision and the end result would depend on how many times a value leaves and re-enters the FPU registers.

Why isn't this random number generator thread-safe?

I was using rand() function to generate pseudo-random numbers between 0,1 for a simulation purpose, but when I decided to make my C++ code run in parallel (via OpenMP) I noticed rand() isn't thread-safe and also is not very uniform.
So I switched to using a (so-called) more uniform generator presented in many answers on other questions. Which looks like this
double rnd(const double & min, const double & max) {
static thread_local mt19937* generator = nullptr;
if (!generator) generator = new mt19937(clock() + omp_get_thread_num());
uniform_real_distribution<double> distribution(min, max);
return fabs(distribution(*generator));
}
But I saw many scientific errors in my original problem which I was simulating. Problems which were both against the results from rand() and also against common sense.
So I wrote a code to generate 500k random numbers with this function, calculate their average and do this for 200 times and plot the results.
double SUM=0;
for(r=0; r<=10; r+=0.05){
#pragma omp parallel for ordered schedule(static)
for(w=1; w<=500000; w++){
double a;
a=rnd(0,1);
SUM=SUM+a;
}
SUM=SUM/w_max;
ft<<r<<'\t'<<SUM<<'\n';
SUM=0;
}
We know if instead of 500k I could do it for infinite times it should be a simple line with value 0.5. But with 500k we will have fluctuations around 0.5.
When running the code with a single thread, the result is acceptable:
But here is the result with 2 threads:
3 threads:
4 threads:
I do not have my 8 threaded CPU right now but the results were even worth there.
As you can see, They are both not uniform and very fluctuated around their average.
So is this pseudo-random generator thread-unsafe too?
Or am I making a mistake somewhere?
There are three observations about your test output I would make:
It has much stronger variance than a good random source's average should provide. You observed this yourself by comparing to the single thread results.
The calculated average decreases with thread count and never reaches the original 0.5 (i.e. it's not just higher variance but also changed mean).
There is a temporal relation in the data, particularly visible in the 4 thread case.
All this is explained by the race condition present in your code: You assign to SUM from multiple threads. Incrementing a double is not an atomic operation (even on x86, where you'll probably get atomic reads and writes on registers). Two threads may read the current value (e.g. 10), increment it (e.g. both add 0.5) and then write the value back to memory. Now you have two threads writing a 10.5 instead of the correct 11.
The more threads try to write to SUM concurrently (without synchronization), the more of their changes are lost. This explains all observations:
How hard the threads race each other in each individual run determines how many results are lost, and this can vary from run to run.
The average is lower with more races (for example more threads) because more results are lost. You can never exceeed the statistical 0.5 average because you only ever lose writes.
As the threads and scheduler "settle in", the variance is reduced. This is a similar reason to why you should "warm up" your tests when benchmarking.
Needless to say, this is undefined behavior. It just shows benign behavior on x86 CPUs, but this is not something the C++ standard guarantees you. For all you know, the individual bytes of a double might be written to by different threads at the same time resulting in complete garbage.
The proper solution would be adding the doubles thread-locally and then (with synchronization) adding them together in the end. OMP has reduction clauses for this specific purpose.
For integral types, you could use std::atomic<IntegralType>::fetch_add(). std::atomic<double> exists but (before C++20) the mentioned function (and others) are only available for integral types.
The problem is not in your RNG, but in your summation. There is simply a race condition on SUM. To fix this, use a reduction, e.g. change the pragma to:
#pragma omp parallel for ordered schedule(static) reduction(+:SUM)
Note that using thread_local with OpenMP is technically not defined behavior. It will probably work in practice, but the interaction between OpenMP and C++11 threading concepts is not well defined (see also this question). So the safe OpenMP alternative for you would be:
static mt19937* generator = nullptr;
#pragma omp threadprivate(generator)

Calling subroutine in parallel environment

I think my problem is related or even identical to the problem described here. But I don't understand what's actually happening.
I'm using openMP with the gfortran compiler and I have the following task to do: I have a density distribution F(X, Y) on a two-dimensional surface with x-coordinates X and y-coordinates Y. The matrix F has the size Nx x Ny.
I now have a set of coordinates Xp(i) and Yp(i) and I need to interpolate the density F onto these points. This problem is made for parallelization.
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
Everything is shared except for i. The function interp2d is doing some simple linear interpolation.
That works fine with one thread but fails with multithreading. I traced the problem down to the hunt-subroutine taken from Numerical Recipes, which gets called by interp2d. The hunt-subroutine basically calculates the index ix such that X(ix) <= Xp(i) < X(ix+1). This is needed to get the starting point for the interpolation.
With multithreading it happens every now and then, that one threads gets the correct index ix from hunt and the thread, that calls hunt next gets the exact same index, even though Xp(i) is not even close to that point.
I can prevent this by using the CRITICAL environment:
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
!$OMP CRITICAL
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
!$OMP END CRITICAL
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
But this decreases the efficiency. If I use for example three threads, I have a load average of 1.5 with the CRITICAL environment. Without I have a load average of 2.75, but wrong results and even sometimes a SIGSEGV runtime error.
What exactly is happening here? It seems to me that all the threads are calling the same hunt-subroutine and if they do it at the same time there is a conflict. Does that make sense?
How can I prevent this?
Combining variable declaration and initialisation in Fortran 90+ has the side effect of giving the variable the SAVE attribute.
integer :: i = 0
is roughly equivalent to:
integer, save :: i
if (first_invocation) then
i = 0
end if
SAVE'd variables retain their value between multiple invocations of the routine and are therefore often implemented as static variables. By the rules governing the implicit data sharing classes in OpenMP, such variables are shared unless listed in a threadprivate directive.
OpenMP mandates that compliant compilers should apply the above semantics even when the underlying language is Fortran 77.

Replace do loop with array notation in fortran

I would like to replace the following do loop with FORTRAN's intrinsic functions and array notations.
do i=2, n
do j=2, n
a=b(j)-b(j-1)
c(i,j)=a*c(i-1,j)+d(i,j)
end do
end do
However, as c(i,j) depends on c(i-1,j) none of following trials worked. Because they do not update c(i,j)
!FORALL(i = 2:n , j = 2:n ) c(i,j)=c(i-1,j)*(b(j)-b(j-1))+d(i,j)
!FORALL(i = 2:n) c(i,2:n)=c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n)
!c(2:n,2:n)=RESHAPE( (/(c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n),i=2,n)/), (/n-1, n-1/))
!c(2:n,2:n)=RESHAPE((/(((b(j)-b(j-1)) *c(i-1,j)+d(i,j) ,j=2,n),i=2,n)/), (/n-1, n-1/))
!c(2:n,2:n)=spread(b(2:n)-b(1:n-1),ncopies = n-1,dim=1) * c(1:n-1,2:n) +d(2:n,2:n)
This is the best I can get. But it still has a do loop
do i=2, n
c(i,2:n)=c(i-1,2:n)*(b(2:n)-b(1:n-1))+d(i,2:n)
end do
Could all do loops be replaced by intrinsic functions and array notation. Or could this one be replaced somehow ?
In my experience, nothing beats the traditional do-loop. All the extension intrinsics create memory and CPU overhead by copying stuff to temporary space (on the stack usually), reshaping and the sort. If you're manipulating large arrays, you may encounter out-of-memory issues with the intrinsic functions.
Your best option is to stick to a 2-d loop that has the indexes correctly laid out:
do i=2, n
e = c(1:n,i-1)
do j=2, n
a=b(j)-b(j-1)
c(j,i)=a*e(j)+d(j,i)
end do
end do
By replacing the indexes (and making sure your dimension declarations follow), you are saving on the memory paging. c(j,i) and d(j,i) references travel column-wise inside memory while c(j,i-1) would have cut across columns (and produced paging overhead). So we copy it to a temporary e array.
I think this will be the fastest....
Starting with
do i=2, n
do j=2, n
a=b(j)-b(j-1)
c(i,j)=a*c(i-1,j)+d(i,j)
end do
end do
we can quickly eliminate the do loops with some nifty use of SIZE, SPREAD and EOSHIFT:
res = SPREAD(b - EOSHIFT(b,-1),2,SIZE(c,2))*EOSHIFT(c,-1) + d
Aha, turns out that the error I was receiving (V1) was due to my using RESHAPE rather than SPREAD. I fixed this in the current version (V2) and it compiles & works with both ifort and gfortran.

Fortran + Openmp more slow that sequential

I have this sequential code in Fortran. My problem is, when I put Openmp directives, the paralleled code is more slow than the sequential, and I don't see the error.
REAL, DIMENSION(:), ALLOCATABLE :: current, next
ALLOCATE ( current(TOTAL_Z), next(TOTAL_Z))
CALL CPU_TIME(t1)
!$OMP PARALLEL SHARED (current, next) PRIVATE (z)
DO t = 1, TOTAL_TIME
!$OMP DO SCHEDULE(STATIC, 2)
DO z = 2, (TOTAL_Z - 1)
next(z) = current (z) + KAPPA*DELTA_T*((current(z - 1) - 2.0*current(z) + current(z + 1)) / DELTA_Z**2)
END DO
!$OMP END DO
current = next
END DO
CALL CPU_TIME(t2)
!$OMP END PARALLEL
TOTAL_Z, TOTAL_TIME, KAPPA, DELTA_T, DELTA_Z are constants.
When I run the paralleled code, I see in htop and my 2 cores are working at 100%
In sequential code, CPU_TIME is 79 seg and in paralleled is 132 seg
Thank
I've just been experiencing the same problem.
It seems that using cpu_time() is not suitable to measure the performance of multi-threaded code. cpu_time() will add the total time of all the threads which is likely to increase with increasing number of threads.
I've found this in another forum,
http://software.intel.com/en-us/forums/topic/281897
You should use system_clock() or omp_get_wtime() functions to get a more accurate timing of your routine.
It is probably slow because of the threads are contending to access the shared variables. If you can change it to use reduction it would likely be faster. But that might not be easy since the calculation for "current" accesses multiple array elements.
depending on the number of iterations, you might also be facing a problem with false-sharing on the nest array. Since the chunk size for the distribution of the DO loop is rather small, the cache line for nest(z), nest(z+1), nest(z+2), nest(z+3), etc might be thrashing between the L1/L2 caches of the CPU.
Cheers,
-michael