I was trying to parallelize a code but it only deteriorated the performance. I wrote a Fortran code which runs several Monte Carlo integrations and then finds their mean.
implicit none
integer, parameter :: n=100
integer, parameter :: m=1000000
real, parameter :: pi=3.141592654
real MC,integ,x,y
integer tid,OMP_GET_THREAD_NUM,i,j,init,inside
read*,init
call OMP_SET_NUM_THREADS(init)
call random_seed()
!$OMP PARALLEL DO PRIVATE(J,X,Y,INSIDE,MC)
!$OMP& REDUCTION(+:INTEG)
do i=1,n
inside=0
do j=1,m
call random_number(x)
call random_number(y)
x=x*pi
y=y*2.0
if(y.le.x*sin(x))then
inside=inside+1
endif
enddo
MC=inside*2*pi/m
integ=integ+MC/n
enddo
!$OMP END PARALLEL DO
print*, integ
end
As I increase the number of threads, run-time increases drastically. I have looked for solutions for such problems and in most cases shared memory elements happen to be the problem but I cannot see how it is affecting my case.
I am running it on a 16 core processor using Intel Fortran compiler.
EDIT: The program after adding implicit none, declaring all variables and adding the private clause
You should not use RANDOM_NUMBER for high performance computing and definitely not in parallel threads. There NO guarantees about the quality of the random number generator and about thread safety of the standard random number generator. See Can Random Number Generator of Fortran 90 be trusted for Monte Carlo Integration?
Some compilers will use a fast algorithm that cannot be called in parallel. Some compilers will ave slow method but callable from parallel. Some will be both fast and allowed from parallel. Some will generate poor quality random sequences, some better.
You should use some parallel PRNG library. There are many. See here for recommendations for Intel https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/283349 I use library based on http://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP in my own slightly improved version https://bitbucket.org/LadaF/elmm/src/e732cb9bee3352877d09ae7f6b6722157a819f2c/src/simplevtk.f90?at=master&fileviewer=file-view-default but be careful, I don't care about the quality of the sequence in my applications, only about speed.
To the old version:
You have a race condition there.
With
inside=inside+1
more threads can be competing for writing and reading the variable. You will have to somehow synchronize the access. If you make it reduction you will have problems with
integ=integ+MC/n
if you make it private, then inside=inside+1 will only count locally.
MC also appears to be in a race condition, because more threads will be writing in it. It is not clear at all what MC does and why is it there, because you are not using the value anywhere. Are you sure the code you show is complete? If not, please see How to make a Minimal, Complete, and Verifiable example.
See this With OpenMP parallelized nested loops run slow an many other examples how a race condition can make program slow.
Related
I have a question about parallel computing. I don't much know about parallel computing, I just came up with an idea and I would like to discuss its practicality.
I am working on a Fortran code, and I have many do loops. For example:
do i=1,1000
*some calculations
end do
The key point is inside the loop, I do some calculations such that the result of the current loop does not influence the next loop (i.e. calculations inside the loop are independent, I can obtain the result for i=100 without having the result from i=99).
What I would like to have is instead of waiting single core to go over the whole loop, distribute this task (execution of do loop) between the cores and make it faster.
For such a scenario can I use parallel computing to increase the speed? I know there are some options in intel Fortran for optimization and parallel computing, but selecting these options is enough? Or do I need any additional code or subroutine to enable parallel computing?
I have build a C++ code without thinking that I would later have the need to multithread it. I have now multithreaded the 3 main for loops with openMP. Here are the performance comparisons (as measured with time from bash)
Single thread
real 5m50.008s
user 5m49.072s
sys 0m0.877s
Multi thread (24 threads)
real 1m22.572s
user 28m28.206s
sys 0m4.170s
The use of 24 cores have reduced the real time by a factor of 4.24. Of course, I did not expect the code to be 24 times faster. I did not really know what to expect actually.
- Is there a rule of thumb that would allow one to make prediction about how much faster will a given code run with n threads in comparison to a single thread?
- Are there general tips in order to improve the performance of multithreaded processes?
I'm sure you know of the obvious like the cost of barriers. But it's hard to draw a line between what is trivial and what could be helpful to someone. Here are a few lessons learned from use, if I think of more I'll add them:
Always try to use thread private variables as long as possible, consider that even for reductions, providing only a small number of collective results.
Prefer parallel runs of long sections of code and long parallel sections (#pragma omp parallel ... #pragma omp for), instead of parallelizing loops separately (#pragma omp parallel for).
Don't parallelize short loops. In a 2-dimensional iteration it often suffices to parallelize the outer loop. If you do parallelize the whole thing using collapse, be aware that OpenMP will linearize it introducing a fused variable and accessing the indices separately incurs overhead.
Use thread private heaps. Avoid sharing pools and collections if possible, even though different members of the collection would be accessed independently by different threads.
Profile your code and see how much time is spent on busy waiting and where that may be occurring.
Learn the consequences of using different schedule strategies. Try what's better, don't assume.
If you use critical sections, name them. All unnamed CSs have to wait for each other.
If your code uses random numbers, make it reproducible: define thread-local RNGs, seed everything in a controllable manner, impose order on reductions. Benchmark deterministically, not statistically.
Browse similar questions on Stack Overflow, e.g., the wonderful answers here.
I have question about testing MPI program. I wrote FW algorithm with Open MPI. The program works fine and correct, but problem is that it takes more time than my sequential program (I have tried to test it on only one computer). Does someone have idea why that happens ? Thanks
It is a common misconception that a parallel implementation of a program will always be quicker than its sequential version.
The trouble with parallelizing a program is it introduces a fairly large overhead with the use of multiple threads, which a sequential program running from a single thread does not suffer from. Not only do we have to initially set up these threads, there is also communication taking place which wasn't necessary with the sequential program.
For relatively small problems, you will find that a sequential solution will almost always out perform the parallel program. As the size of your problem scales, the cost of managing multiple processes gradually becomes negligible with respect to the computational cost of the problem itself. As a result, your parallel version will begin to outperform your sequential program.
I am having an issue with the simple code bellow. I am trying to use OpenMP with GFortran. The Results of the code bellow for x should be the same with AND without !$OMP statements, since the parallel code and serial code should output the same result.
program test
implicit none
!INCLUDE 'omp_lib.h'
integer i,j
Real(8) :: x,t1,t2
x=0.0d0
!$OMP PARALLEL DO PRIVATE(i,j) shared(X)
Do i=1,3
Write(*,*) I
!pause
Do j=1,10000000
!$OMP ATOMIC
X=X+2.d0*Cos(i*j*1.0d0)
end do
end do
!$OMP END PARALLEL Do
write(*,*) x
end program test
But strangely I am getting the following results for x:
Parallel:-3.17822355415XXXXX
Serial: -3.1782235541569084
where XXXXX is some random digits. Every time I run the serial code, I get the same result (-3.1782235541569084). How can i fix it? Is this problem due to some OpenMP working precision option?
Floating-point arithmetic is not strictly associative. In f-p arithmetic neither a+(b+c)==(a+b)+c nor a*(b*c)==(a*b)*c is always true, as they both are in real arithmetic. This is well-known, and extensively explained in answers to other questions here on SO and at other reputable places on the web. I won't elaborate further on that point here.
As you have written your program the order of operations by which the final value of X is calculated is non-deterministic, that is it may (and probably does) vary from execution to execution. The atomic directive only permits one thread at a time to update X but it doesn't impose any ordering constraints on the threads reaching the directive.
Given the nature of the calculation in your program I believe that the difference you see between serial and parallel executions may be entirely explained by this non-determinism.
Before you think about 'fixing' this you should first be certain that it is a problem. What makes you think that the serial code's answer is the one true answer ? If you were to run the loops backwards (still serially) and get a different answer (quite likely) which answer is the one you are looking for ? In a lot of scientific computing, which is probably the core domain for OpenMP, the data available and the numerical methods used simply don't support assertions of the accuracy of program results beyond a handful of significant figures.
If you still think that this is a problem that needs to be fixed, the easiest approach is to simply take out the OpenMP directives.
To add to what High Performance Mark said, another source of discrepancy is that the compiler might have emitted x87 FPU instructions to do the math. x87 uses 80-bit internal precision and an optimised serial code would only use register arithmetic before it actually writes the final value to the memory location of X. In the parallel case, since X is a shared variable, at each iteration the memory location is being updated. This means that the 80-bit x87 FPU register is flushed to a 64-bit memory location and then read back, and some bits of precision are thus lost on each iteration, which then adds up to the observed discrepancy.
This effect is not present if modern 64-bit CPU is being used together with a compiler that emits SIMD instructions, e.g. SSE2+ or AVX. Those only work with 64-bit internal precision and then using only register addressing does not result in better precision than if the memory value is being flushed and reloaded in each iteration. In this case the difference comes from the non-associativity as explained by High Performance Mark.
Those effects are pretty much expected and usually accounted for. They are well studied and understood, and if your CFD algorithm breaks down when run in parallel, then the algorithm is highly numerically unstable and I would in no way trust the results it gives, even in the serial case.
By the way, a better way to implement your loop would be to use reduction:
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:X)
Do i=1,3
Write(*,*) I
!pause
Do j=1,10000000
X=X+2.d0*Cos(i*j*1.0d0)
end do
end do
This would allow the compiler to generate register-optimised code for each thread and then the loss of precision would only occur at the very end when the threads sum their local partial values in order to obtain the final value of X.
I USED THE CLAUSE ORDERED WITH YOU CODE AND THIS WORK. BUT WORK WITH THIS CLAUSE IS THE SAME THAT RUN THE CODE IN SERIAL.
I have a C++ code containig many for-loops parallelized with openMP on a 8-thread computer.
But the speed of execution with single thread is faster than parallel 8 thread. I was told that if the load of the for-loops increases parallelization will become efficient.
Here with load I mean for example maximum number of iterations for a loop. The thing is I dont have a chance to compare single and 8-thread parallel code for a huge amount of data.
Should I use parallel code anyway? Is it true that parallelization efficiency will increase with load of for-loops?
The canonical use case for OpenMP is the distribution among a team of threads of the iterations of a high iteration count loop with the condition that the loop iterations have no direct or indirect dependencies.
You can spot what I mean by direct dependencies by considering the question Does the order of loop iteration execution affect the results ?. If, for example, iteration N+1 uses the results of iteration N you have such a dependency, running the loop iterations in reverse order will change the output of the routine.
By indirect dependencies I mean mainly data races, in which threads have to coordinate their access to shared data, in particular they have to ensure that writes to shared variables happen in the correct sequence.
In many cases you can redesign a loop-with-dependencies to remove those dependencies.
IF you have a high iteration count loop which has no such dependencies THEN you have a candidate for good speed-up with OpenMP. Here are the buts:
There is some parallel overhead to the computation at the start and end of each such loop, if the loop count isn't high enough this overhead may outweigh, partially or wholly, the speedup of running the iterations in parallel. The only way to determine if this is affecting your code is to test and measure.
There can be dependencies between loop iterations more subtle than I have already outlined. Depending on your system architecture and the computations inside the loop you might (without realising it) program your threads to fight over access to cache or to I/O resources, or to any other resource. In the worst cases this can lead to increasing the number of threads leading to decreasing execution rate.
You have to make sure that each OpenMP thread is backed up by hardware, not by the pseudo-hardware that hyperthreading represents. One core per OpenMP thread, hyperthreading is snake oil in this domain.
I expect there are other buts to put in here, perhaps someone else will help out.
Now, turning to your questions:
Should I use parallel code anyway? Test and measure.
Is it true that parallelization efficiency will increase with load of for-loops? Approximately, but for your code on your hardware, test and measure.
Finally, you can't become a serious parallel computationalist without measuring run times under various combinations of circumstances and learning what the measurements you make are telling you. If you can't compare sequential and parallel execution for huge amounts of data, you'll have to measure them for modest amounts of data and understand the lessons you learn before making predictions about behaviour when dealing with huge amounts of data.