Is it possible to speed-up my Fortran code by parallelization? - fortran

I have a question about parallel computing. I don't much know about parallel computing, I just came up with an idea and I would like to discuss its practicality.
I am working on a Fortran code, and I have many do loops. For example:
do i=1,1000
*some calculations
end do
The key point is inside the loop, I do some calculations such that the result of the current loop does not influence the next loop (i.e. calculations inside the loop are independent, I can obtain the result for i=100 without having the result from i=99).
What I would like to have is instead of waiting single core to go over the whole loop, distribute this task (execution of do loop) between the cores and make it faster.
For such a scenario can I use parallel computing to increase the speed? I know there are some options in intel Fortran for optimization and parallel computing, but selecting these options is enough? Or do I need any additional code or subroutine to enable parallel computing?

Related

Performance issue while parallelizing using OpenMP

I was trying to parallelize a code but it only deteriorated the performance. I wrote a Fortran code which runs several Monte Carlo integrations and then finds their mean.
implicit none
integer, parameter :: n=100
integer, parameter :: m=1000000
real, parameter :: pi=3.141592654
real MC,integ,x,y
integer tid,OMP_GET_THREAD_NUM,i,j,init,inside
read*,init
call OMP_SET_NUM_THREADS(init)
call random_seed()
!$OMP PARALLEL DO PRIVATE(J,X,Y,INSIDE,MC)
!$OMP& REDUCTION(+:INTEG)
do i=1,n
inside=0
do j=1,m
call random_number(x)
call random_number(y)
x=x*pi
y=y*2.0
if(y.le.x*sin(x))then
inside=inside+1
endif
enddo
MC=inside*2*pi/m
integ=integ+MC/n
enddo
!$OMP END PARALLEL DO
print*, integ
end
As I increase the number of threads, run-time increases drastically. I have looked for solutions for such problems and in most cases shared memory elements happen to be the problem but I cannot see how it is affecting my case.
I am running it on a 16 core processor using Intel Fortran compiler.
EDIT: The program after adding implicit none, declaring all variables and adding the private clause
You should not use RANDOM_NUMBER for high performance computing and definitely not in parallel threads. There NO guarantees about the quality of the random number generator and about thread safety of the standard random number generator. See Can Random Number Generator of Fortran 90 be trusted for Monte Carlo Integration?
Some compilers will use a fast algorithm that cannot be called in parallel. Some compilers will ave slow method but callable from parallel. Some will be both fast and allowed from parallel. Some will generate poor quality random sequences, some better.
You should use some parallel PRNG library. There are many. See here for recommendations for Intel https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/283349 I use library based on http://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP in my own slightly improved version https://bitbucket.org/LadaF/elmm/src/e732cb9bee3352877d09ae7f6b6722157a819f2c/src/simplevtk.f90?at=master&fileviewer=file-view-default but be careful, I don't care about the quality of the sequence in my applications, only about speed.
To the old version:
You have a race condition there.
With
inside=inside+1
more threads can be competing for writing and reading the variable. You will have to somehow synchronize the access. If you make it reduction you will have problems with
integ=integ+MC/n
if you make it private, then inside=inside+1 will only count locally.
MC also appears to be in a race condition, because more threads will be writing in it. It is not clear at all what MC does and why is it there, because you are not using the value anywhere. Are you sure the code you show is complete? If not, please see How to make a Minimal, Complete, and Verifiable example.
See this With OpenMP parallelized nested loops run slow an many other examples how a race condition can make program slow.

General tips to improve multithreading (in C++)

I have build a C++ code without thinking that I would later have the need to multithread it. I have now multithreaded the 3 main for loops with openMP. Here are the performance comparisons (as measured with time from bash)
Single thread
real 5m50.008s
user 5m49.072s
sys 0m0.877s
Multi thread (24 threads)
real 1m22.572s
user 28m28.206s
sys 0m4.170s
The use of 24 cores have reduced the real time by a factor of 4.24. Of course, I did not expect the code to be 24 times faster. I did not really know what to expect actually.
- Is there a rule of thumb that would allow one to make prediction about how much faster will a given code run with n threads in comparison to a single thread?
- Are there general tips in order to improve the performance of multithreaded processes?
I'm sure you know of the obvious like the cost of barriers. But it's hard to draw a line between what is trivial and what could be helpful to someone. Here are a few lessons learned from use, if I think of more I'll add them:
Always try to use thread private variables as long as possible, consider that even for reductions, providing only a small number of collective results.
Prefer parallel runs of long sections of code and long parallel sections (#pragma omp parallel ... #pragma omp for), instead of parallelizing loops separately (#pragma omp parallel for).
Don't parallelize short loops. In a 2-dimensional iteration it often suffices to parallelize the outer loop. If you do parallelize the whole thing using collapse, be aware that OpenMP will linearize it introducing a fused variable and accessing the indices separately incurs overhead.
Use thread private heaps. Avoid sharing pools and collections if possible, even though different members of the collection would be accessed independently by different threads.
Profile your code and see how much time is spent on busy waiting and where that may be occurring.
Learn the consequences of using different schedule strategies. Try what's better, don't assume.
If you use critical sections, name them. All unnamed CSs have to wait for each other.
If your code uses random numbers, make it reproducible: define thread-local RNGs, seed everything in a controllable manner, impose order on reductions. Benchmark deterministically, not statistically.
Browse similar questions on Stack Overflow, e.g., the wonderful answers here.

Recursive parallel function not using all cores

I recently implemented a recursive negamax algorithm, which I parallelized using OpenMP.
The interesting part is this:
#pragma omp parallel for
for (int i = 0; i < (int) pos.size(); i++)
{
int val = -negamax(pos[i].first, -player, depth - 1).first;
#pragma omp critical
if (val >= best)
{
best = val;
move = pos[i].second;
}
}
On my Intel Core i7 (4 physical cores and hyper threading), I observed something very strange: while running the algorithm, it was not using all 8 available threads (logical cores), but only 4.
Can anyone explain why is it so? I understand the reasons the algorithm doesn't scale well, but why doesn't it use all the available cores?
EDIT: I changed thread to core as it better express my question.
First, check whether you have enough iteration count, pos.size(). Obviously this should be a sufficient number.
Recursive parallelism is an interesting pattern, but it may not work very well with OpenMP, unless you're using OpenMP 3.0's task, Cilk, or TBB. There are several things that need to be considered:
(1) In order to use a recursive parallelism, you mostly need to explicitly call omp_set_nested(1). AFAIK, most implementations of OpenMP do not recursively spawn parallel for, because it may end up creating thousands physical threads, just exploding your operating system.
Until OpenMP 3.0's task, a OpenMP has a sort of 1-to-1 mapping of logical parallel task to a physical task. So, it won't work well in such recursive parallelism. Try out, but don't be surprised if even thousands threads are created!
(2) If you really want to use recursive parallelism with a traditional OpenMP, you need to implement code that controls the number of active threads:
if (get_total_thread_num() > TOO_MANY_THREADS) {
// Do not use OpenMP
...
} else {
#pragma omp parallel for
...
}
(3) You may consider OpenMP 3.0's task. In your code, there could be huge number of concurrent tasks due to a recursion. To be efficiently working on a parallel machine, there must be an efficient mapping algorithm these logical concurrent tasks to physical threads (or logical processor, core). A raw recursive parallelism in OpenMP will create actual physical threads. OpenMP 3.0's task does not.
You may refer to my previous answer related to a recursive parallelism: C OpenMP parallel quickSort.
(4) Intel's Cilk Plus and TBB support full nested and recursive parallelism. In my small test program, the performance was far better than OpenMP 3.0. But, that was 3 years ago. You should check the latest OpenMP's implementation.
I have not a detailed knowledge of negamax and minimax. But, my gut says that using recursive pattern and a lock are unlikely to give a speedup. A simple Google search gives me: http://supertech.csail.mit.edu/papers/dimacs94.pdf
"But negamax is not a efficient serial search algorithm, and thus, it
makes little sense to parallelize it."
Optimal parallelism level has some additional considerations except as much threads as available. For example, operation systems used to schedule all threads of a single process to a single processor to optimize cache performance (unless the programmer changed it explicitly).
I guess OpenMP makes similar considerations when executing such code and you cannot always assume the maximum thread number is executed/
Whaddya mean all 8 available threads ? A CPU like that can probably run 100s of threads ! You may believe that 4 cores with hyper-threading equates to 8 threads, but your OpenMP installation probably doesn't.
Check:
Has the environment variable OMP_NUM_THREADS been created and set ? If it is set to 4 there's your answer, your OpenMP environment is configured to start only 4 threads, at most.
If that environment variable hasn't been set, investigate the use, and impact, of the OpenMP routines omp_get_num_threads() and omp_set_num_threads(). If the environment variable has been set then omp_set_num_threads() will override it at run time.
Whether 8 hyper-threads outperform 4 real threads.
Whether oversubscribing, eg setting OMP_NUM_THREADS to 16, does anything other than ruin performance.

Parallelization efficiency of openMP

I have a C++ code containig many for-loops parallelized with openMP on a 8-thread computer.
But the speed of execution with single thread is faster than parallel 8 thread. I was told that if the load of the for-loops increases parallelization will become efficient.
Here with load I mean for example maximum number of iterations for a loop. The thing is I dont have a chance to compare single and 8-thread parallel code for a huge amount of data.
Should I use parallel code anyway? Is it true that parallelization efficiency will increase with load of for-loops?
The canonical use case for OpenMP is the distribution among a team of threads of the iterations of a high iteration count loop with the condition that the loop iterations have no direct or indirect dependencies.
You can spot what I mean by direct dependencies by considering the question Does the order of loop iteration execution affect the results ?. If, for example, iteration N+1 uses the results of iteration N you have such a dependency, running the loop iterations in reverse order will change the output of the routine.
By indirect dependencies I mean mainly data races, in which threads have to coordinate their access to shared data, in particular they have to ensure that writes to shared variables happen in the correct sequence.
In many cases you can redesign a loop-with-dependencies to remove those dependencies.
IF you have a high iteration count loop which has no such dependencies THEN you have a candidate for good speed-up with OpenMP. Here are the buts:
There is some parallel overhead to the computation at the start and end of each such loop, if the loop count isn't high enough this overhead may outweigh, partially or wholly, the speedup of running the iterations in parallel. The only way to determine if this is affecting your code is to test and measure.
There can be dependencies between loop iterations more subtle than I have already outlined. Depending on your system architecture and the computations inside the loop you might (without realising it) program your threads to fight over access to cache or to I/O resources, or to any other resource. In the worst cases this can lead to increasing the number of threads leading to decreasing execution rate.
You have to make sure that each OpenMP thread is backed up by hardware, not by the pseudo-hardware that hyperthreading represents. One core per OpenMP thread, hyperthreading is snake oil in this domain.
I expect there are other buts to put in here, perhaps someone else will help out.
Now, turning to your questions:
Should I use parallel code anyway? Test and measure.
Is it true that parallelization efficiency will increase with load of for-loops? Approximately, but for your code on your hardware, test and measure.
Finally, you can't become a serious parallel computationalist without measuring run times under various combinations of circumstances and learning what the measurements you make are telling you. If you can't compare sequential and parallel execution for huge amounts of data, you'll have to measure them for modest amounts of data and understand the lessons you learn before making predictions about behaviour when dealing with huge amounts of data.

Complex loop in a C++ program portable to OpenMP and MPI?

I have a C++ number crunching program. The structure is:
a) data input, data preparation
b) "big" loop, uses global and local data (lots of different variables in both cases)
c) postprocess results and write data
The most intensive part is "b", which is basically a loop. I need to speedup the program in a cluster. 25 blades, 4 cores each. I wonder whether I could use here OpenMP and MPI, or if you can point me to tutorials, not general cases, but complex and "big" for loops.
Thanks
Actually, you should use both.
Use MPI to distribute tasks between blades and OpenMP to fully utilize each blade. Take some time to understand how memory and sharing works on each case.
You cannot devide your task between blade using OpenMP. Try to devide you loop on several part and distribute capacity on them.
For example if you want composition of 2 vectors with N size. N/2 will be on one node and another part on another.
But transmition costs between blades is palpable. Thus if your task is not actually great. May be would be better if you distribute it into 4 cores.