Why is Fortran slower than Octave? - fortran

Normally, Fortran is leaps and bounds faster than Octave. However, I've noticed that when performing similar matrix manipulations with Fortran's "spread" function, compared to Octave's "repmat" function, Octave runs about twice as fast as my compiled Fortran version of the program. Is anyone able to give an explanation as to why that is? Is there something that I need to be doing in order to increase Fortran's performance?
First, here's my simple fortran program:
program test_program
double precision, parameter, dimension(1000,500) :: A = reshape([ ... ],[1000,500])
logical, dimension(:,:,:), allocatable :: blockL
integer, dimension(2) :: Adim
Adim = shape(A)
blockL = spread(A,3,Adim(1))==spread(transpose(A),1,Adim(1))
end program test_program
Now here's my corresponding program, written in Octave:
A = [ ... ]; % This is the same "A" that was used in Fortran
Adim1 = size(A,1);
blockL = repmat(A,[1 1 Adim1])==repmat(permute(A,[3 2 1]),[Adim1 1 1]);
Once compiled, the Fortran program takes about fifteen seconds to run. The Octave program takes about eight. Shouldn't a compiled program always be faster than an interpreted one? Any ideas on what I may be doing wrong, or how I could speed up my Fortran program?
I'm using the gfortran compiler on a machine that is running Lubuntu 14.04. The following shows exactly how I'm compiling it, when I type my command at the Linux console:
gfortran test_program.f08 -o test_program
I have Octave installed on the same machine, so both programs are using the same resources and hardware when being compared.
Thanks so much for your time and attention. I appreciate any guidance that anyone is able, or willing, to provide.

As many have pointed out, in the comments section, interpreted languages, like Octave, are only slow based on the number of lines, or command calls, that exist in your program. When it comes to intrinsic functions, Octave can be relatively fast.
As for building a Fortran program that is as fast as the Octave version, I ultimately turned to coarrays.
Fortran's coarray feature is an excellent way for programmers to exploit the benefits of multi-processor computers. Octave uses optimized tools, similar to tools like BLAS, in their implementation. Those tools are also capable of exploiting the parallel nature of processors by using SIMD features of modern processors. Although I'm not completely sure that coarrays use that exact implementation, on the processor level (they probably do), they do allow programmers to include parallelization in their programs.
By using coarrays, I was able to write a program that is as fast as the fastest Octave version of my program; maybe even faster.
One commenter suggested compiling my original Fortran program using the "-O3" optimization option with gfortran. In my case, doing so resulted in no speed increase.

Related

IEEE_UNDERFLOW_FLAG IEEE_DENORMAL in Fortran 77

I am new to Fortran and coding in general so I apologize if my terminology is not correct.
I am using a Linux machine with the gfortran compiler.
I am doing research this summer which involves me getting a program written in about 1980 working again. It is written in Fortran 77. I have all the code as well as some documentation about it.
In its current form it I am receiving a "IEEE_UNDERFLOW_FLAG IEEE_DENORMAL" error. My first thought is that this code was meant to be developed under a different environment/architecture.
The documentation states “This program was designed to run on the HARRIS computer system. It also can be run on VAX system if the single precision variables are changed into double precision variables both in the main code and the subroutine package.”
I have tried changing the single precision variables to double precision variables, but I may have done that wrong. If this is the correct thing to do any insight would be great.
I have also tried compiling the coding with -std=legacy and -m32. I receive the same error from this as well.
Any advice to get me going in the right direction would be greatly appreciated.
"IEEE_UNDERFLOW_FLAG IEEE_DENORMAL is signalling" is not that uncommon. It is NOT an error message.
The meaning is that there are denormal numbers generated when running the code.
It may be a hint about numerical problems in your code, but it is not an error per se. Probably it means that your program finished successfully.
Fortran in its latest edition requires that all floating point exceptions that are signalling be reported when a STOP statement is executed. See gfortran IEEE exception inexact BTW, that also means that your program is not being compiled as Fortran 77 but as Fortran 2003 or higher.
Note that even if you request the Fortran 95 standard by -std=f95 the note is still displayed, but it can be controlled by the -ffpe-summary=list flag.
The linked answer also says that a way to avoid these warnings is to not finish the program by a STOP statement, but by running till the END PROGRAM. If you have something like
STOP
END
or
STOP
END PROGRAM
in your code, just remove the STOP, it is useless, if not even harmful.
You may but you don't have to be successful in getting rid of that by using double precision. If there are numerical problems in the algorithms, they will stay there even with doubles. But they may become less apparent. Or they might not, it depends. You don't have to re-write your code for that, just use -fdefault-real-8 or -freal-4-real-8 or similar. Read more about these options in your gfortran manual. You could even try quadruple precision, but double should normally be sufficient for all reasonable algorithms.

How to speed up this C++ program with eigen library against matlab?

I want to use C++ for big linear algebra computation. As a starting step, these comparison programs I created in C++ and matlab. I am also giving astonishing execution time here. Can you suggest way to beat matlab or atleast get comparable performance? I know that C++ uses highly vectorized methods for computations. So in large scientific programming involving linear algebra, should one always go for matlab instead of C++? I personally think that matlab doesn't give good performance for large computations therefore C++ is preferred to matlab in such cases. However my program results go contrary to this belief.
C++ program compiled with gcc:
#include <iostream>
#include <Eigen\Dense> //EIGEN library
using namespace Eigen;
using namespace std;
int main()
{
MatrixXd A;
A.setRandom(1000, 1000);
MatrixXd B;
B.setRandom(1000, 1000);
MatrixXd C;
C=A*B;
}
Execution time: 24.141 s
Here is matlab program:
function [ ] = Trial( )
clear all;
close all;
clc;
tic;
A=rand([1000,1000]);
B=rand([1000,1000]);
C=A*B;
toc
end
Elapsed time is 0.073883 seconds.
It is extremely hard to beat MATLAB, even with all optimizations turned on. To get the most out of Eigen you need to compile with parallel support (-fopenmp in gcc), and turn optimizations on (-O3). Even in this case, MATLAB will be slightly faster, mainly because it is using the Intel MKL proprietary library to get the most out of Intel chips, so unless you buy it I don't think you will be able to beat it. I am currently using Eigen for a project and wasn't able to beat MATLAB (at least not for dense matrix multiplication).
For example, for A*B where A and B are 1000 x 1000 complex matrices, the best average time I can get is:
MATLAB: 0.32 seconds
Eigen: 0.44 seconds
For 2000 x 2000,
MATLAB: 2.80 seconds
Eigen: 3.45 seconds
System: MacbookPro 2013, OS X.
PS: you should make absolutely sure that you turn optimizations on (-O3) and also compile with parallel support, -fopenmp. This is the reason you're probably getting this huge difference in running time. So you should compile your program as:
g++ -O3 -fopenmp <other compiling flags/parameters> main.cpp
To get the best of Eigen, compile with optimizations ON (e.g. -O3 compiler flag), with OpenMP enabled (e.g., -fopenmp), and disable hyper-threading or specify to openmp the true number of physical cores (e.g., export OMP_NUM_THREADS=4 if you have 8 hyper-threaded "cores", but 4 physical cores).
Finally, you might also consider using the devel branch and enable AVX (e.g., -mavx) and FMA if your CPU does support FMA (e.g., -mfma).
Actually Matlab (if you don't buy the expensive parallel computing toolbox) hardly use multi-threading. It's only used in the libraries called by Matlab which are probably more efficient that what you're using now.
You can check this link to understand (and check) what libraries your Matlab uses http://undocumentedmatlab.com/blog/math-libraries-version-info-upgrade
It's also possible to use them in your C program (though they might have hidden the headers or something, at least you still have the .dll since they need that to run Matlab)

How to determine what gfortran is vectorizing

I am trying to write a massively parallel monte carlo code part of which will be exported to a xeon phi coprocessor. To ensure that I am using the coprocessor efficiently, I would like to see which parts of my code the compiler, currently gfortran, is able to vectorize. I understand I can do this using the ifort commane -vec-report. However, I won't have access to the coprocessor for about a month, and therefore am stuck with gfortran for the time being. However, I would like to start optimizing now if possible. Unfortanately, I cannot seem to find the command line flag for gfortran that tells me which part of the code is being vectorized. Is there one. If so, what is it?
thanks
You can try, if -fopt-info suits you needs.
You can get more output by using -fopt-info-all which includes information on successfull and missed optimization.
The vectorizer can be instructed to be verbose and report what it does:
-ftree-vectorizer-verbose=n
where larger integer n means more verbose report.
For more see http://gcc.gnu.org/projects/tree-ssa/vectorization.html
(It took me 1 minute to google it).

Why does trivial loop in python run so much slower than the same in C++? And how to optimize that? [duplicate]

This question already has answers here:
Why are Python Programs often slower than the Equivalent Program Written in C or C++?
(11 answers)
Closed 9 years ago.
simply run a near empty for loop in python and in C++ (as following), the speed are very different, the python is more than a hundred times slower.
a = 0
for i in xrange(large_const):
a += 1
int a = 0;
for (int i = 0; i < large_const; i++)
a += 1;
Plus, what can I do to optimize the speed of python?
(Addition:
I made a bad example here in the first version of this question, I don't really mean that a=1 so that C/C++ compiler could optimize that, I mean the loop itself consumed a lot of resource (maybe I should use a+=1 as example).. And what I mean by how to optimize is that if the for loop is just like a += 1 that simple, how could it be run in the similar speed as C/C++? In my practice, I used Numpy so I can't use pypy anymore (for now), is there some general methods for making loop far more quickly (such as generator in generating list)?
)
A smart C compiler can probably optimize your loop away by recognizing that at the end, a will always be 1. Python can't do that because when iterating over xrange, it needs to call __next__ on the xrange object until it raises StopIteration. python can't know if __next__ will have side-effect until it calls it, so there is no way to optimize the loop away. The take-away message from this paragraph is that it is MUCH HARDER to optimize a Python "compiler" than a C compiler because python is such a dynamic language and requires the compiler to know how the object will behave in certain circumstances. In
C, that's much easier because C knows exactly what type every object is ahead of time.
Of course, compiler aside, python needs to do a lot more work. In C, you're working with base types using operations supported in hardware instructions. In python, the interpreter is interpreting the byte-code one line at a time in software. Clearly that is going to take longer than machine level instructions. And the data model (e.g. calling __next__ over and over again) can also lead to a lot of function calls which the C doesn't need to do. Of course, python does this stuff to make it much more flexible than you can have in a compiled language.
The typical way to speed up python code is to use libraries or intrinsic functions which provide a high level interface to low-level compiled code. scipy and numpy are excellent examples this kind of library. Other things you can look into are using pypy which includes a JIT compiler -- you probably won't reach native speeds, but it'll probably beat Cpython (the most common implementation), or writing extensions in C/fortran using the Cpython-API, cython or f2py for performance critical sections of code.
Simply because Python is a more high level language and has to do more different things on every iteration (like acquiring locks, resolving variables etc.)
“How to optimise” is a very vague question. There is no “general” way to optimise any Python program (everythng possible was already done by the developers of Python). Your particular example can be optimsed this way:
a = 1
That's what any C compiler will do, by the way.
If your program works with numeric data, then using numpy and its vectorised routines often gives you a great performance boost, as it does everything in pure C (using C loops, not Python ones) and doesn't have to take interpreter lock and all this stuff.
Python is (usually) an interpreted language, meaning that the script has to be read line-by-line at runtime and its instructions compiled into usable bytecode at that point.
C is (usually) a compiled language, so by the time you're running it you're working with pure machine code.
Python will never be as fast as C, for that reason.
Edit: In fact, python compiles INTO C code at run time, that's why you get those .pyc files.
As you go more abstract the speed will go down. The fastest code is assembly code which is written directly.
Read this question Why are Python Programs often slower than the Equivalent Program Written in C or C++?

Fortran: differences between generated code compiled using two different compilers

I have to work on a fortran program, which used to be compiled using Microsoft Compaq Visual Fortran 6.6. I would prefer to work with gfortran but I have met lots of problems.
The main problem is that the generated binaries have different behaviours. My program takes an input file and then has to generate an output file. But sometimes, when using the binary compiled by gfortran, it crashes before its end, or gives different numerical results.
This a program written by researchers which uses a lot of float numbers.
So my question is: what are the differences between these two compilers which could lead to this kind of problem?
edit:
My program computes the values of some parameters and there are numerous iterations. At the beginning, everything goes well. After several iterations, some NaN values appear (only when compiled by gfortran).
edit:
Think you everybody for your answers.
So I used the intel compiler which helped me by giving some useful error messages.
The origin of my problems is that some variables are not initialized properly. It looks like when compiling with compaq visual fortran these variables take automatically 0 as a value, whereas with gfortran (and intel) it takes random values, which explain some numerical differences which add up at the following iterations.
So now the solution is a better understanding of the program to correct these missing initializations.
There can be several reasons for such behaviour.
What I would do is:
Switch off any optimization
Switch on all debug options. If you have access to e.g. intel compiler, use ifort -CB -CU -debug -traceback. If you have to stick to gfortran, use valgrind, its output is somewhat less human-readable, but it's often better than nothing.
Make sure there are no implicit typed variables, use implicit none in all the modules and all the code blocks.
Use consistent float types. I personally always use real*8 as the only float type in my codes. If you are using external libraries, you might need to change call signatures for some routines (e.g., BLAS has different routine names for single and double precision variables).
If you are lucky, it's just some variable doesn't get initialized properly, and you'll catch it by one of these techniques. Otherwise, as M.S.B. was suggesting, a deeper understanding of what the program really does is necessary. And, yes, it might be needed to just check the algorithm manually starting from the point where you say 'some NaNs values appear'.
Different compilers can emit different instructions for the same source code. If a numerical calculation is on the boundary of working, one set of instructions might work, and another not. Most compilers have options to use more conservative floating point arithmetic, versus optimizations for speed -- I suggest checking the compiler options that you are using for the available options. More fundamentally this problem -- particularly that the compilers agree for several iterations but then diverge -- may be a sign that the numerical approach of the program is borderline. A simplistic solution is to increase the precision of the calculations, e.g., from single to double. Perhaps also tweak parameters, such as a step size or similar parameter. Better would be to gain a deeper understanding of the algorithm and possibly make a more fundamental change.
I don't know about the crash but some differences in the results of numerical code in an Intel machine can be due to one compiler using 80-doubles and the other 64-bit doubles, even if not for variables but perhaps for temporary values. Moreover, floating-point computation is sensitive to the order elementary operations are performed. Different compilers may generate different sequence of operations.
Differences in different type implementations, differences in various non-Standard vendor extensions, could be a lot of things.
Here are just some of the language features that differ (look at gfortran and intel). Programs written to fortran standard work on every compiler the same, but a lot of people don't know what are the standard language features, and what are the language extensions, and so use them ... when compiled with a different compiler troubles arise.
If you post the code somewhere I could take a quick look at it; otherwise, like this, 'tis hard to say for certain.