openMP not improving runtime - fortran

I inherited a piece of Fortran code as am tasked with parallelizing it for the 8-core machine we have. I have two version of the code, and I am trying to use openMP compiler directives to speed it up. It works on one piece of code, but not the other, and I cannot figure out why--They're almost identical!
I ran each piece of code with and without the openMP tags, and the first one showed speed improvements, but not the second one. I hope I am explaining this clearly...
Code sample 1: (significant improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATT(IN1,IN2) = DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
Code sample 2: (no improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATTREL = DATA(2*((IN2-1)*NN(1)+IN1)-1))/NN(1)*NN(2))
SCATTIMG = DATA(2*((IN2-1)*NN(1)+IN1)))/NN(1)*NN(2))
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
I thought it might be issues with memory ovehead and such, and have tried various combinations of putting variables in shared() and private() clauses, but they either cause segmentations faults or make it even slower.
I also thought it might be that I'm not doing enough work in the loop to see an improvement, but since there's improvement in the smaller loop that doesn't make sense to me.
Can anyone shed some light onto what I can to do see a real speed boost in the second one?
Data on speed boost for code sample 1:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 2m 21.321s
With openMP tags: 2m 20.640s
Average runtime (profile for just this snippet)
Without openMP tags: 6.3s
With openMP tags: 4.75s
Data on speed boost for code sample 2:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 4m 46.659s
With openMP tags: 4m 49.200s
Average runtime (profile for just this snippet)
Without openMP tags: 15.14s
With openMP tags: 46.63s

The observation that the code runs slower in parallel than in serial tells me that the culprit is very likely false sharing.
The SCATT array is shared and each thread accesses a slice of it for both reading and writing. There is no race condition in your code however the threads writing to the same array (albeit different slices) make things slower.
The reason is that each thread loads a portion of the array SCATT in cache and whenever another thread writes in that portion of SCATT this invalidates the data previously stored in cache. Although the input data has not been changed since there is no race condition (the other thread updated a different slice of SCATT) the processor gets a signal that cache is invalid and thus reloads the data (see the link above for details). This causes high data transfer overhead.
The solution is to make each slice private to a given thread. In your case it is even simpler as you do not require reading access to SCATT at all. Just replace
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
with
SCATT0 = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT0+1.0
SCATT(IN1,IN2) = SCATT0
where SCATT0 is a private variable.
And why this does not happen in the first snippet? It certainly does however I suspect that the compiler might have optimized the problem away. When it calculated DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2)) it very likely stored it in a register and used this value instead of SCATT(IN1,IN2) in UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0.
Besides if you want to speed the code up you should make the loops more efficient. The first rule of parallelization is don't do it! Optimize the serial code first. So replace snippet 1 with (you could even through in workshare around the last line)
DATA/(NN(1)*NN(2))
!$OMP PARALLEL DO private(temp)
DO IN2=1,NN(2)
temp = (IN2-1)*NN(1)
SCATT(:,IN2) = DATA(temp+1:temp+NN(1))
ENDDO
!$OMP END PARALLEL DO
UADI = SCATT+1.0
You can do something similar with snippet 2 as well.

Related

Openmp fortran, task dependency on overlapping subarrays

TLDR: task parallel code fails to recognize task dependency on partially overlapping subarrays
I'm trying to parallelize some fortran code using task parallelism. I'm struggling to get the task dependency to work.
The following minimal example illustrates my problem:
program test
integer, parameter :: n = 100
double precision :: A(n,n),B(n,n)
integer :: istart, istop
!$omp parallel
!$omp single
istart = 20
istop = 50
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 1"
call sleep(5)
write(*,*) "stopping task 1"
!$omp end task
istart = 30
istop = 60
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 2"
call sleep(5)
write(*,*) "stopping task 2"
!$omp end task
!$omp end single nowait
!$omp end parallel
end program
Which I compile using gcc version 9.4.0 and the -fopenmp flag.
The code has two tasks. Task 1 depends on A(20:50,20:50) and task 2 depends on A(30:60,30:60). These subarrays overlap, so I would expect task 2 to wait until task 1 has completed, but the output I get is:
starting task 1
starting task 2
stopping task 1
stopping task 2
If I comment out the lines
istart = 30
istop = 60
so that the subarrays are exactly the same instead of just overlapping, task 2 does wait on task 1.
Are task dependencies with overlapping subarrays not supported in openmp? Or am I defining the dependencies wrong somehow?
These subarrays overlap
This is forbidden by the OpenMP standard as pointed out by #ThijsSteel.
This choice has been done because of the resulting runtime overhead. Indeed, checking overlapping regions of array in dependencies was very expensive in pathological cases (especially for arrays with many dimensions). A typical example of pathological case when many tasks write on the same array (but on different exclusive parts) and then one tasks operate on the whole array. This create a lot of checks and dependencies while this can be optimized using a specific dependency pattern. An even more pathological case is a transposition of the sub-arrays (from n horizontal bands to n vertical bands): this quickly results in n² dependencies which is insanely inefficient when n is large. An optimization consists in aggregating the dependencies so to get 2*n dependencies but this is expensive to do at runtime and a better way would actually not to use dependency but task group synchronization.
From the user point-of-view, there are only few options:
operate at bigger grain even though it means having more synchronization (so possibly poor performance);
try to regroup tasks operating on distinct sub-array by checking that manually. You can cheat OpenMP by using fake dependencies so to do that or add more tasks synchronizations (typically a taskwait). You can also make use of advanced task-dependence-type modifiers like depobj, inoutset
or mutexinoutset now available in recent version of the OpenMP standard.
Most of the time, an algorithm is typically composed of a several waves of tasks and each wave is operating on non-overlapping sub-arrays. In this case, you can just add a task synchronization at the end of the computation (or use recursive tasks).
This is a known limitation of OpenMP and some researchers worked on it (including me). I expect new specific dependency modifiers to be added in the future, and possibly user-defined partitioning-related features in the long run (some research runtimes like StarPU does that for example, but not everyone agree adding that in OpenMP as this is a bit high-level and not simple to include).

Stepping into a pragma directive with a parallel for loop (C++, OpenMP, Nsight Eclipse IDE)

Using Nsight Eclipse Edition 10.2 to debug a plain C++ code using gdb 7.11.1.
The code uses a pragma call to OpenMP for forking a for-loop.
The following is a minimal working example,
where a simple array q is filled with values of another variable p:
#pragma omp parallel for schedule (static)
for(int p=pstart; p<pend; p++){
const unsigned i = id[p];
if(start <= i && i < end)
q[i - start] = p;
}
In debug mode I would want use the step-in function (classically F5) to follow how the array q gets filled in with p's. However, that steps over the for loop altogether, and resumes where the parallel threads join again.
Is there a way to force stepping into a pragma directive/openMP loop?
Is there a way to force stepping into a pragma directive/openMP loop?
That will depend on the debugger, but it's also not entirely clear what it would mean. Since many threads execute the parallel loop, would you expect each of them to stop and then step together? How do you expect to show the different state of each thread? (each will have its own p and i). What happens if the thread control flow diverges?
There are debuggers which can do some of this (such as TotalView on Linux), but it's not trivial to do (and TotalView costs money [which is entirely fair and reasonable :-)]).
What you may need to do is set a breakpoint inside the loop, and then handle it being hit by N threads...
(Which doesn't answer your precise question, but does let you see what's going on in the loop, which is possibly what you really need to do!)

Program is not any faster with OpenMP

My goal is to parallelize a section in my Fortran program. The flow of the program is:
Read data from a file
make some computations
write the results to 2 different files
Here I want to parallelize the writing process since I’m writing into different files.
module foo
use omp_lib
implicit none
type element
integer, dimension(:), allocatable :: v1, v2
real(kind=8), dimension(:,:), allocatbale :: M
end type element
contains
subroutine test()
implicit none
type(element) :: e
do
e = read_data_from_file()
call compute_data(e)
!$OMP SECTIONS
!$OMP SECTION
!$ call write_to_file1(e)
!$OMP SECTION
!$ call write_to_file2(e)
!$OMP END SECTIONS
end do
end subroutine test
...
end module foo
But this program isn't going anything faster. So I think that I’m missing something?
In general one can divide scientific computing codes in bandwidth bound and computational bound algorithms. The bandwidth bound algorithms are all that only do few operations on the data they need. Like having O(n) data where O(n) flops are performed on. Thinking of the hard disk speed or the network connection speed, I/O is a bandwidth bound operation as well and therefore not or only badly parallelizable.
If you really want to gain performance out of the parallelization split the code into bandwidth bound and computational bound algorithms and use your time to parallelize the later ones.
If you specify you problem more precisely there are hundreds of experts eager to solve it. From the comment to the answer above I see that you are using binary output but still has bandwidth left to write faster, that means that you disk speed is fine and you're not limited by parsing, but rather that you actual program is not putting out data in a faster pace than this.
So optimize your code, to make it catch up with your write-speed, instead of increasing the write speed with an equally slow code.
Writing them 2 files sequentially at the max of your bandwidth is as fast and much easier than writing in parallel (at the same max speed).
If I am mistaken, and you are indeed limited by IO, maybe this other question/answer can help you: How to avoid programs in status D.

OpenMP race condition (Fortran 77 w/ COMMON block)

I am trying to parallelise some legacy Fortran code with OpenMP.
Checking for race conditions with Intel Inspector, I have come across a problem in the following code (simplified, tested example):
PROGRAM TEST
!$ use omp_lib
implicit none
DOUBLE PRECISION :: x,y,z
COMMON /firstcomm/ x,y,z
!$OMP THREADPRIVATE(/firstcomm/)
INTEGER :: i
!$ call omp_set_num_threads(3)
!$OMP PARALLEL DO
!$OMP+ COPYIN(/firstcomm/)
!$OMP+ PRIVATE(i)
do i=1,3000
z = 3.D0
y = z+log10(z)
x=y+z
enddo
!$OMP END PARALLEL DO
END PROGRAM TEST
Intel Inspector detects a race condition between the following lines:
!$OMP PARALLEL DO (read)
z = 3.D0 (write)
The Inspector "Disassembly" view offers the following about the two lines, respectively (I do not understand much about these, apart from the fact that the memory addresses in both lines seem to be different):
0x3286 callq 0x2a30 <memcpy>
0x3338 movq %r14, 0x10(%r12)
As in my main application, the problem occurs for one (/some) variable in the common block, but not for others that are treated in what appears to be the same way.
Can anyone spot my mistake, or is this race condition a false positive?
I am aware that the use of COMMON blocks, in general, is discouraged, but I am not able to change this for the current project.
Technically speaking, your example code is incorrect since you are using COPYIN to initialise threadprivate copies with data from uninitialised COMMON BLOCK. But that is not the reason for the data race - adding a DATA statement or simply assigning to x, y, and z before the parallel region does not change the outcome.
This is either a (very old) bug in Intel Fortran Compiler, or Intel is interpreting strangely the text of the OpenMP standard (section 2.15.4.1 of the current version):
The copy is done, as if by assignment, after the team is formed and prior to the start of execution of the associated structured block.
Intel implements the emphasised text by inserting a memcpy at the beginning of the outlined procedure. In other words:
!$OMP PARALLEL DO COPYIN(/firstcomm/)
do i = 1, 3000
...
end do
!$OMP END PARALLEL DO
becomes (in a mixture of Fortran and pseudo-code):
par_region0:
my_firstcomm = get_threadprivate_copy(/firstcomm/)
if (my_firstcomm != firstcomm) then
memcpy(my_firstcomm, firstcomm, size of firstcomm)
end if
// Actual implementation of the DO worksharing construct
call determine_iterations(1, 3000, low_it, high_it)
do i = low_it, high_it
...
... my_firstcomm used here instead of firstcomm
...
end do
call openmp_barrier
end par_region0
MAIN:
// Prepare a parallel region with 3 threads
// and fire the outlined code in the worker threads
call start_parallel_region(3, par_region0)
// Fire the outlined code in the master thread
call par_region0
call end_parallel_region
The outlined procedure first finds the address of the threadprivate copy of the common block, then compares that address to the address of the common block itself. If both addresses match, then the code is being executed in the master thread and no copy is needed, otherwise memcpy is called to make a bitwise copy of the master's data into the threadprivate block.
Now, one would expect that there should be a barrier at the end of the initialisation part and right before the start of the loop, and although Intel employees claim that there is one, there is none (tested with ifort 11.0, 14.0, and 16.0). Even more, the Intel Fortran Compiler does not honour the list of variables in the COPYIN clause and copies the entire common block if any variable contained in it is listed in the clause, i.e. COPYIN(x) is treated the same as COPYIN(/firstcomm/).
Whether those are bugs or features of Intel Fortran Compiler, only Intel could tell. It could also be that I'm misreading the assembly output. If anyone could find the missing barrier, please let me know. One possible workaround would be to split the combined directive and insert an explicit barrier before the worksharing construct:
!$OMP PARALLEL COPYIN(/firstcomm/) PRIVATE(I)
!$OMP BARRIER
!$OMP DO
do i = 1, 3000
z = 3.D0
y = z+log10(z)
x = y+z
end do
!$OMP END DO
!$OMP END PARALLEL
With that change, the data race will shift into the initialisation of the internal dispatch table within the log10 call, which is probably a false positive.
GCC implements COPYIN differently. It creates a shared copy of the threadprivate data of the master thread, which copy it then passes on to the worker threads for use in the copy process.

Fortran + Openmp more slow that sequential

I have this sequential code in Fortran. My problem is, when I put Openmp directives, the paralleled code is more slow than the sequential, and I don't see the error.
REAL, DIMENSION(:), ALLOCATABLE :: current, next
ALLOCATE ( current(TOTAL_Z), next(TOTAL_Z))
CALL CPU_TIME(t1)
!$OMP PARALLEL SHARED (current, next) PRIVATE (z)
DO t = 1, TOTAL_TIME
!$OMP DO SCHEDULE(STATIC, 2)
DO z = 2, (TOTAL_Z - 1)
next(z) = current (z) + KAPPA*DELTA_T*((current(z - 1) - 2.0*current(z) + current(z + 1)) / DELTA_Z**2)
END DO
!$OMP END DO
current = next
END DO
CALL CPU_TIME(t2)
!$OMP END PARALLEL
TOTAL_Z, TOTAL_TIME, KAPPA, DELTA_T, DELTA_Z are constants.
When I run the paralleled code, I see in htop and my 2 cores are working at 100%
In sequential code, CPU_TIME is 79 seg and in paralleled is 132 seg
Thank
I've just been experiencing the same problem.
It seems that using cpu_time() is not suitable to measure the performance of multi-threaded code. cpu_time() will add the total time of all the threads which is likely to increase with increasing number of threads.
I've found this in another forum,
http://software.intel.com/en-us/forums/topic/281897
You should use system_clock() or omp_get_wtime() functions to get a more accurate timing of your routine.
It is probably slow because of the threads are contending to access the shared variables. If you can change it to use reduction it would likely be faster. But that might not be easy since the calculation for "current" accesses multiple array elements.
depending on the number of iterations, you might also be facing a problem with false-sharing on the nest array. Since the chunk size for the distribution of the DO loop is rather small, the cache line for nest(z), nest(z+1), nest(z+2), nest(z+3), etc might be thrashing between the L1/L2 caches of the CPU.
Cheers,
-michael