OpenMP Working Precision - fortran

I am having an issue with the simple code bellow. I am trying to use OpenMP with GFortran. The Results of the code bellow for x should be the same with AND without !$OMP statements, since the parallel code and serial code should output the same result.
program test
implicit none
!INCLUDE 'omp_lib.h'
integer i,j
Real(8) :: x,t1,t2
x=0.0d0
!$OMP PARALLEL DO PRIVATE(i,j) shared(X)
Do i=1,3
Write(*,*) I
!pause
Do j=1,10000000
!$OMP ATOMIC
X=X+2.d0*Cos(i*j*1.0d0)
end do
end do
!$OMP END PARALLEL Do
write(*,*) x
end program test
But strangely I am getting the following results for x:
Parallel:-3.17822355415XXXXX
Serial: -3.1782235541569084
where XXXXX is some random digits. Every time I run the serial code, I get the same result (-3.1782235541569084). How can i fix it? Is this problem due to some OpenMP working precision option?

Floating-point arithmetic is not strictly associative. In f-p arithmetic neither a+(b+c)==(a+b)+c nor a*(b*c)==(a*b)*c is always true, as they both are in real arithmetic. This is well-known, and extensively explained in answers to other questions here on SO and at other reputable places on the web. I won't elaborate further on that point here.
As you have written your program the order of operations by which the final value of X is calculated is non-deterministic, that is it may (and probably does) vary from execution to execution. The atomic directive only permits one thread at a time to update X but it doesn't impose any ordering constraints on the threads reaching the directive.
Given the nature of the calculation in your program I believe that the difference you see between serial and parallel executions may be entirely explained by this non-determinism.
Before you think about 'fixing' this you should first be certain that it is a problem. What makes you think that the serial code's answer is the one true answer ? If you were to run the loops backwards (still serially) and get a different answer (quite likely) which answer is the one you are looking for ? In a lot of scientific computing, which is probably the core domain for OpenMP, the data available and the numerical methods used simply don't support assertions of the accuracy of program results beyond a handful of significant figures.
If you still think that this is a problem that needs to be fixed, the easiest approach is to simply take out the OpenMP directives.

To add to what High Performance Mark said, another source of discrepancy is that the compiler might have emitted x87 FPU instructions to do the math. x87 uses 80-bit internal precision and an optimised serial code would only use register arithmetic before it actually writes the final value to the memory location of X. In the parallel case, since X is a shared variable, at each iteration the memory location is being updated. This means that the 80-bit x87 FPU register is flushed to a 64-bit memory location and then read back, and some bits of precision are thus lost on each iteration, which then adds up to the observed discrepancy.
This effect is not present if modern 64-bit CPU is being used together with a compiler that emits SIMD instructions, e.g. SSE2+ or AVX. Those only work with 64-bit internal precision and then using only register addressing does not result in better precision than if the memory value is being flushed and reloaded in each iteration. In this case the difference comes from the non-associativity as explained by High Performance Mark.
Those effects are pretty much expected and usually accounted for. They are well studied and understood, and if your CFD algorithm breaks down when run in parallel, then the algorithm is highly numerically unstable and I would in no way trust the results it gives, even in the serial case.
By the way, a better way to implement your loop would be to use reduction:
!$OMP PARALLEL DO PRIVATE(j) REDUCTION(+:X)
Do i=1,3
Write(*,*) I
!pause
Do j=1,10000000
X=X+2.d0*Cos(i*j*1.0d0)
end do
end do
This would allow the compiler to generate register-optimised code for each thread and then the loss of precision would only occur at the very end when the threads sum their local partial values in order to obtain the final value of X.

I USED THE CLAUSE ORDERED WITH YOU CODE AND THIS WORK. BUT WORK WITH THIS CLAUSE IS THE SAME THAT RUN THE CODE IN SERIAL.

Related

Is it possible to speed-up my Fortran code by parallelization?

I have a question about parallel computing. I don't much know about parallel computing, I just came up with an idea and I would like to discuss its practicality.
I am working on a Fortran code, and I have many do loops. For example:
do i=1,1000
*some calculations
end do
The key point is inside the loop, I do some calculations such that the result of the current loop does not influence the next loop (i.e. calculations inside the loop are independent, I can obtain the result for i=100 without having the result from i=99).
What I would like to have is instead of waiting single core to go over the whole loop, distribute this task (execution of do loop) between the cores and make it faster.
For such a scenario can I use parallel computing to increase the speed? I know there are some options in intel Fortran for optimization and parallel computing, but selecting these options is enough? Or do I need any additional code or subroutine to enable parallel computing?

Performance issue while parallelizing using OpenMP

I was trying to parallelize a code but it only deteriorated the performance. I wrote a Fortran code which runs several Monte Carlo integrations and then finds their mean.
implicit none
integer, parameter :: n=100
integer, parameter :: m=1000000
real, parameter :: pi=3.141592654
real MC,integ,x,y
integer tid,OMP_GET_THREAD_NUM,i,j,init,inside
read*,init
call OMP_SET_NUM_THREADS(init)
call random_seed()
!$OMP PARALLEL DO PRIVATE(J,X,Y,INSIDE,MC)
!$OMP& REDUCTION(+:INTEG)
do i=1,n
inside=0
do j=1,m
call random_number(x)
call random_number(y)
x=x*pi
y=y*2.0
if(y.le.x*sin(x))then
inside=inside+1
endif
enddo
MC=inside*2*pi/m
integ=integ+MC/n
enddo
!$OMP END PARALLEL DO
print*, integ
end
As I increase the number of threads, run-time increases drastically. I have looked for solutions for such problems and in most cases shared memory elements happen to be the problem but I cannot see how it is affecting my case.
I am running it on a 16 core processor using Intel Fortran compiler.
EDIT: The program after adding implicit none, declaring all variables and adding the private clause
You should not use RANDOM_NUMBER for high performance computing and definitely not in parallel threads. There NO guarantees about the quality of the random number generator and about thread safety of the standard random number generator. See Can Random Number Generator of Fortran 90 be trusted for Monte Carlo Integration?
Some compilers will use a fast algorithm that cannot be called in parallel. Some compilers will ave slow method but callable from parallel. Some will be both fast and allowed from parallel. Some will generate poor quality random sequences, some better.
You should use some parallel PRNG library. There are many. See here for recommendations for Intel https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/283349 I use library based on http://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP in my own slightly improved version https://bitbucket.org/LadaF/elmm/src/e732cb9bee3352877d09ae7f6b6722157a819f2c/src/simplevtk.f90?at=master&fileviewer=file-view-default but be careful, I don't care about the quality of the sequence in my applications, only about speed.
To the old version:
You have a race condition there.
With
inside=inside+1
more threads can be competing for writing and reading the variable. You will have to somehow synchronize the access. If you make it reduction you will have problems with
integ=integ+MC/n
if you make it private, then inside=inside+1 will only count locally.
MC also appears to be in a race condition, because more threads will be writing in it. It is not clear at all what MC does and why is it there, because you are not using the value anywhere. Are you sure the code you show is complete? If not, please see How to make a Minimal, Complete, and Verifiable example.
See this With OpenMP parallelized nested loops run slow an many other examples how a race condition can make program slow.

Where is integer addition and subtraction event count from intel Vtune?

I am using intel VTune to profile my program.
The CPU I am using is IVY Bridge.
All the hardware instruction event can be found here:
https://software.intel.com/en-us/node/589933
FP_COMP_OPS_EXE.X87
Number of FP Computational Uops Executed this
cycle. The number of FADD, FSUB, FCOM, FMULs, integer MULsand IMULs,
FDIVs, FPREMs, FSQRTS, integer DIVs, and IDIVs. This event does not
distinguish an FADD used in the middle of a transcendental flow from a
s
FP_COMP_OPS_EXE.X87 seems to include Integer Multiplication and Integer Division; however, there is no Integer Addition and Integer Subtraction there. I can not find those two kinds of instruction either from the above website.
Can anyone tell me what is the event that counts integer addition and integer subtraction instructions?
I'm reading a lot into your question, but here goes:
It's possible that if your code is computationally bound you could find ways to infer the significance of integer adds and subs without measuring them directly. For example, UOPS_RETIRED.ALL - FP_COMP_OPS_EXE.ALL would give you a very rough estimate of adds and subs, assuming that you've already done something to establish that your code is compute bound.
Have you? If not, it might help to start with VTune's basic analysis and then eliminate memory, cache and front end bottlenecks. If you've already done this, you have a few more options:
Cross-reference UOPS_DISPATCHED_PORT with an Ivy Bridge block diagram, or even better, a list of which specific types of arithmetic can execute on which ports (which I can't find).
Modify your program source, compiler flags or assembly, rerun a coarser-grained profile like basic analysis, and see whether you see an impact at the level of a measure like INST_RETIRED.ANY / CPU_CLK_UNHALTED.
Sorry there doesn't seem to be a more direct answer.

Fix arithmetic error in distributed version

I am inverting a matrix via a Cholesky factorization, in a distributed environment, as it was discussed here. My code works fine, but in order to test that my distributed project produces correct results, I had to compare it with the serial version. The results are not exactly the same!
For example, the last five cells of the result matrix are:
serial gives:
-250207683.634793 -1353198687.861288 2816966067.598196 -144344843844.616425 323890119928.788757
distributed gives:
-250207683.634692 -1353198687.861386 2816966067.598891 -144344843844.617096 323890119928.788757
I had post in the Intel forum about that, but the answer I got was about getting the same results across all the executions I will make with the distributed version, something that I already had. They seem (in another thread) to be unable to respond to this:
How to get same results, between serial and distributed execution? Is this possible? This would result in fixing the arithmetic error.
I have tried setting this: mkl_cbwr_set(MKL_CBWR_AVX); and using mkl_malloc(), in order to align memory, but nothing changed. I will get the same results, only in the case that I will spawn one process for the distributed version (which will make it almost serial)!
The distributed routines I am calling: pdpotrf() and pdpotri().
The serial routines I am calling: dpotrf() and dpotri().
Your differences seem to appear at about the 12th s.f. Since floating-point arithmetic is not truly associative (that is, f-p arithmetic does not guarantee that a+(b+c) == (a+b)+c), and since parallel execution does not, generally, give a deterministic order of the application of operations, these small differences are typical of parallelised numerical codes when compared to their serial equivalents. Indeed you may observe the same order of difference when running on a different number of processors, 4 vs 8, say.
Unfortunately the easy way to get deterministic results is to stick to serial execution. To get deterministic results from parallel execution requires a major effort to be very specific about the order of execution of operations right down to the last + or * which almost certainly rules out the use of most numeric libraries and leads you to painstaking manual coding of large numeric routines.
In most cases that I've encountered the accuracy of the input data, often derived from sensors, does not warrant worrying about the 12th or later s.f. I don't know what your numbers represent but for many scientists and engineers equality to the 4th or 5th sf is enough equality for all practical purposes. It's a different matter for mathematicians ...
As the other answer mentions getting the exact same results between serial and distributed is not guaranteed. One common technique with HPC/distributed workloads is to validate the solution. There are a number of techniques from calculating percent error to more complex validation schemes, like the one used by the HPL. Here is a simple C++ function that calculates percent error. As #HighPerformanceMark notes in his post the analysis of this sort of numerical error is incredibly complex; this is a very simple method, and there is a lot of info available online about the topic.
#include <iostream>
#include <cmath>
double calc_error(double a,double x)
{
return std::abs(x-a)/std::abs(a);
}
int main(void)
{
double sans[]={-250207683.634793,-1353198687.861288,2816966067.598196,-144344843844.616425, 323890119928.788757};
double pans[]={-250207683.634692, -1353198687.861386, 2816966067.598891, -144344843844.617096, 323890119928.788757};
double err[5];
std::cout<<"Serial Answer,Distributed Answer, Error"<<std::endl;
for (int it=0; it<5; it++) {
err[it]=calc_error(sans[it], pans[it]);
std::cout<<sans[it]<<","<<pans[it]<<","<<err[it]<<"\n";
}
return 0;
}
Which produces this output:
Serial Answer,Distributed Answer, Error
-2.50208e+08,-2.50208e+08,4.03665e-13
-1.3532e+09,-1.3532e+09,7.24136e-14
2.81697e+09,2.81697e+09,2.46631e-13
-1.44345e+11,-1.44345e+11,4.65127e-15
3.2389e+11,3.2389e+11,0
As you can see the order of magnitude of the error in every case is on the order of 10^-13 or less and in one case non-existent. Depending on the problem you are trying to solve error on this order of magnitude could be considered acceptable. Hopefully this helps to illustrate one way of validating a distributed solution against a serial one, or at least gives one way to show how far apart the parallel and serial algorithm are.
When validating answers for big problems and parallel algorithms it can also be valuable to perform several runs of the parallel algorithm, saving the results of each run. You can then look to see if the result and/or error varies with the parallel algorithm run or if it settles over time.
Showing that a parallel algorithm produces error within acceptable thresholds over 1000 runs(just an example, the more data the better for this sort of thing) for various problem sizes is one way to assess the validity of a result.
In the past when I have performed benchmark testing I have noticed wildly varying behavior for the first several runs before the servers have "warmed up". At the time I never bother to check to see if error in the result stabilized over time the same way performance did, but it would be interesting to see.

Performance of C++ Operators

Is there any sort of performance difference between the arithmetic operators in c++, or do they all run equally fast? E.g. is "++" faster than "+=1"? What about "+=10000"? Does it make a significant difference if the numbers are floats instead of integers? Does "*" take appreciably longer than "+"?
I tried performing 1 billion each of "++", "+=1", and "+=10000". The strange thing is that the number of clock cycles (according to time.h) is actually counterintuitive. One might expect that if any of them are the fastest, it is "++", followed by "+=1", then "+=10000", but the data shows a slight trend in the opposite direction. The difference is more pronounced on 10 billion operations. This is all for integers.
I am dabbling in scientific computing, so I wanted to test the performance of operators. If any of the operators operated in time that was linear in terms of the inputs, for example.
About your edit, the language says nothing about the architecture it's running on. Your question is platform dependent.
That said, typically all fundamental data-type operations have a one-to-one correspondence to assembly.
x86 for example has an instruction which increments a value by 1, which i++ or i += 1 would translate into. Addition and multiplication also have single instructions.
Hardware-wise, it's fairly obvious that adding or multiplying numbers is at least linear in the number of bits in the numbers. Because the hardware has a constant number of bits, it's O(1).
Floats have their own processing unit, usually, which also has single instructions for operations.
Does it matter?
Why not write the code that does what you need it to do. If you want to add one, use ++. If you want to add a large number, add a large number. If you need floats, use floats. If you need to multiply two numbers, then multiply them.
The compiler will figure out the best way to do what you want, so instead of trying to be tricky, do what you need and let it do the hard work.
After you've written your working code, and you decide it's too slow, profile it and find out why. You'll find it's not silly things like multiplying versus adding, but rather going about the entire (sub-)problem in the wrong way.
Practically, all of the operators you listed will be done in a single CPU instruction anyway, on desktop platforms.
No, no, yes*, yes*, respectively.
* but do you really care?
EDIT: to give some kind of idea with a modern processor, you may be able to do 200 integer additions in the time it takes to make one memory access, and only 50 integer multiplications. If you think about it, you're still going to be bound by the memory accesses most of the time.
What you are asking is: What basic operations get transformed into which assembly instructions and what is the performance of those instructions on my specific architecture. And this is also your answer: The code they get translated to is dependant on your compiler and it's knowledge of your architecture, their performance depends on your architecture.
Mind you: in C++ operators can be overloaded for user defined types. They can behave differently from built-in types and the implementation of the overload can be non-trivial (no just one instruction).
Edit: A hint for testing. Most compilers support outputting the generated assembly code. The option for gcc is -S. If you use some other compiler have a look at their documentation.
The best answer is to time it with your compiler.
Look up the optimization manuals for your CPU. That's the only place you're going to find answers.
Get your compiler to output the generated assembly. Download the manuals for your CPU. Look up the instructions used by the compiler in the manual, and you know how they perform.
Of course, this presumes that you already know the basics of how a pipelined, superscalar out-of-order CPU operates, what branch prediction, instruction and data cache and everything else means. Do your homework.
Performance is a ridiculously complicated subject. Depending on context, floating-point code may be as fast as (or faster than) integer code, or it may be four times slower. Usually branches carry almost no penalty, but in special cases, they can be crippling. Sometimes, recomputing data is more efficient than caching it, and sometimes not.
Understand your programming language. Understand your compiler. Understand your CPU. And then examine exactly what the compiler is doing in your case, by profiling/timing, and on when necessary by examining the individual instructions. (and when timing your code, be aware of all the caveats and gotchas that can invalidate your benchmarks: Make sure optimizations are enabled, but also that the code you're trying to measure isn't optimized away. Take the cache into account (if the data is already in the CPU cache, it'll run much faster. If it has to read from physical memory to begin with, it'll take extra time. Both can invalidate your measurements if you're not careful. Keep in mind what you want to measure exactly)
For your specific examples, why should ++i be faster than i += 1? They do the exact same thing? Sometimes, it may make a difference whether you're adding a constant or a variable, but in this case, you're adding the constant one in both cases.
And in general, instructions take a fixed constant time regardless of their operands. adding one to something takes just as long as adding -2000 or 1772051912. The same goes for multiplication or division.
But if you care about performance, you need to understand how the entire technology stack works, not just rely on a few simple rules of thumb like "integer is faster than floating point, and ++ is faster than +=" (Apart from anything else, such simple rules of thumb are almost never true, at least not in every case)
Here is a twist on your evaluations: try Loop Unrolling. Loop unrolling is where you repeat the same statements in a loop to reduce the number of iterations in the loop.
Most modern processors hate branch instructions. The processors have a queue of pre-fetched instructions, which speeds up processing. They really hate branch instructions, because the processor has to clear out the queue and reload it after a branch. This takes more time than just processing sequential instructions.
When coding for processing time, try to minimize the number of branches, which can occur in loop constructs and decision constructs.
Depends on architecture, the built in operators for integer arithmetic translate directly to assembly (as I understand it) ++, +=1, and += 10000 are probably equally fast, multiplication would depend on the platform, overloaded operators would depend on you
Donald Knuth : "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
unless you are writing extremely time critical software, you should probably worry about other things
Short answer: you should turn optimizations on before measuring.
The long answer: If you turned optimizations on, you're performing the operations on integers, and still you get different times for ++i; and i += 1;, then it's probably time to get a better compiler -- the two statements have exactly the same semantics and a competent compiler should translate them into the same instruction sequence.
"Does it make a significant difference if the numbers are floats instead of integers?"
-It depends on what kind of processor you are running on. Integer operations are faster on current x86 compatible CPUs.
About i++ and i+=1: there shouldn't be a difference with any good compiler, while you may expect i+=10000 to be slightly slower on x86 CPUs.
"Does "*" take appreciably longer than "+"?"
-Typically yes.
Note that you may run into all sorts of bottlenecks, in which case the speed difference between the operations doesn't show up. Eg. memory bandwidth, CPU pipeline stall due to data dependencies, etc...
The performance problems caused by C++ operators do not come from the operators and not from the operators implementation. It comes from the syntax, from hidden code being run without you knowing.
The best example, is implementing quick sort, on an object which has the operator[] implemented, but internally it's using a linked list. Now instead of O(nlogn) [1] you will get O(n^2logn).
The problem with performance is that you cannot know exactly what your code will eventually be.
[1] I know that quick sort is actually O(n^2), but it rarely gets to it, the average distribution will give you O(nlogn).