Are FORTRAN 77 programs faster than Fortran 90 ones? - fortran

Today I was reading code from some very popular numerical libraries written in FORTRAN 77 such as QUADPACK (last updated in 1987), and I was wondering if there is any reason not to rewrite those libraries in Fortran 90 apart from the big amount of work it would pose, given the great improvements Fortran 90 brought to the language, including free-form source, better control structures so GO TO could be forgotten, vectorization, interfaces and so on.
Is it because FORTRAN 77 compilers produce more optimized code, maybe it is better for parallel execution? Notice that I'm not even talking about Fortran 2003 for example, which is only 8 years old: I'm talking about Fortran 90, so I assume it has enough widespread and the compilers are ready. I don't have contact with the industry, anyway.
Edit: janneb is right: LAPACK is actually written in Fortran 90.

Well, like "pst" mentioned, the "big amount of work it would pose" is a pretty major reason.
Some further minor points:
LAPACK IS Fortran 90, these days, in the sense that the latest version no longer compiles with a F77 compiler. That being said, it's far from a rewrite, only a few things that were changed.
Most of the F90 features you mention make it easier and faster to write robust programs, it doesn't necessary make the resulting programs any faster.
It wasn't that long ago that free F90 compilers were not available (plenty of people used g77!), so for a widely used library like LAPACK not using F90 features was likely a conscious decision.
A F77 compiler does not, generally, produce faster code than a F90 compiler. If not for any other reason, then because it's likely obsolete and cannot optimize for the latest CPU's. A modern Fortran compiler might create faster code from F77 than from an equivalent F90 program which makes extensive use of things like pointers etc., but that is highly dependent on the program in question (e.g. using pointers and fancier data structures may allow usage of better algorithms, allowing the F90 program to produce results faster even though it might execute at a lower average utilization of the CPU arithmetic units).
Vectorization, if by this you mean the F90+ array syntax, is mostly a programmer convenience issue rather than allowing faster code. A competent compiler will vectorize the equivalent DO loop just as well.

There is no reason to put in a ton of work to (maybe) improve something that works well. It should be noted that even though Fortran 90 introduced many new features, it did not change the language. Remember that Fortran 90 Standard is backwards compatible with FORTRAN 77. A modern compiler will be able to optimize a FORTRAN 77 code just as well. There are features introduced in Fortan 90 (e.g. pointers) that would hinder the efficiency and that should be avoided if one cares about execution time, e.g. in HPC.
So it would not really make a difference. For a modern compiler, well written FORTAN 77 is just as optimizable - it is like a cake without icing. Here, in case of Fortran 90 and later, icing looks better than it tastes - it is a convenience for the programmer, but does not necessarily improve the program efficiency.

There is one major reason why Fortran 77 programs might be faster :
Allocated arrays (Fortran 90) are much slower than declared-at-compile-time arrays.
Both are not put at the same place in memory. That's the stack vs heap memory management in Fortran.
See here for instance
This being said, Fortran 90+ have fast 'all in block' array operations. I'm a big fan of gcc-fortran (gfortran) and without any compilation option, for an array a of size N
a = a + 1
is 4 times faster than
do i = 1 , N
a(i) = a(i) + 1
end do
This bench is for me, on my machine, on gfortran without optimization option, and without any claming that this is professional benchmarking.
The allocation "problem" (which is not a problem but a feature) is not architecture, compiler or anything related but to the Fortran Standard.

Whether F77 program will be slower or faster then the equivalent F90 program, rewritten with newer syntax is discussable, but let's disregard that for now.
What should however be taken into account, is that nobody really cares about speed only. People only care about speed of execution in the cases where it is relevant, and where it is bussiness profitable (I'm sure there is a better term for this, but nothing comes to mind right now ... cost effective maybe).
Since those two (rather popular libraries) are still in F77, it is obvious that it is the general opinion that the costs of rewriting them outweights the benefits gained, benefits in term of speed of execution, and benefits in terms of cost effectiveness of that whole process.

Related

why my visual studio 2013 is slow for c++ code compiling

a c++ project, even if I only change minor code, it will spend a lot on time for compiling like below
and I found when I compile, the cl. exe must load a huge of memory by several MB's speed, anyone have idea for this situation
anyway , the code's result is correct but every time a lot of time is wasted on compiling , hope some one could help me
That's actually not a lot of memory.
Modern C++ compilers assume that developers have decent PC's. Unlike JIT compilers (as in Java JVM's), normal C++ compilers have a very strong bias towards better code generation at the expense of compile-time efficiency.
This is also driven by the fact that PC's are generally a lot cheaper than developers. 8-16 GB of memory is not excessively expensive.

Transpiling to C vs C++ : range of CPU instructions

I am considering the question of transpiling a language (home-grown DSL) to C vs to C++.
I haven't done any 'native' programming for over 15 years, so I want to check my assumptions.
Am I right into assuming that transpiling to the newest C++ version (17) would enable the native compiler to use a much wider range of 'modern' Intel/AMD CPU instructions, resulting in a more efficient executable (beyond the multi-threading / memory-model part of C++, which already by itself seems a good enough reason to go for C++)?
Put another way, isn't a large part of 'more recent' CPU instructions never generated by a C compiler, simply because it has too little information about the programmer intent, due to the simpler syntax of C? I know I could access all CPU instructions with assembler, but that is precisely what I don't want to do. Ideally, I would want the generated code to still be as platform-independent as possible.
All of your assumptions about the relationship between programming language and "modern CPU instructions" are incorrect.
Let's consider the GNU Compiler Collection.
The choice of language here doesn't much matter, as the language front-ends all end up generating the same intermediate form called GIMPLE. The optimizing passes then work on that.
The range of CPU instructions which can be emitted is controlled by the -mtune option. For x86, GCC is capable of emitting modern AVX 512 instructions when optimizing some very plain-looking C code. Automatic loop vectorisation is a powerful thing. Try it out: implement memcpy and look at the generated assembly.
My advice: generate clean, un-clever C code, and crank up the optimization level. Just like you would do if writing code by hand.
You might also consider implementing your language directly as a front-end to GCC or LLVM, without transpiling to C or C++. LLVM was designed for this purpose, intended to make implementing new languages easy, and still taking advantage of modern optimization approaches.

Compiler memory consumption with template libraries (boost + Eigen)

I am writing a template algorithm that makes use of boost::accumulators and the Eigen linear algebra library.
While compiling, the visual studio compiler (cl.exe), memory consumption peaks at over 2.5GB of RAM, and my PC (windows 7 32 bit with 3GB virtual address space) becomes unresponsive (for quite a long time: ~1 minute). The binary files (.obj) are 10-20MB for these compilation units.
My questions (not directed towards these specific libraries)
Is this normal behavior for code that heavily uses templates?
Is there something that can be done to reduce the memory demands and
compile time?
If there is no good solution to the problem, why isn't this
addressed by the people that design the programming language? The
more people understand C++, the more they are likely to use templates, and generate hard-to compile code, and bloated binaries.
If there is no good solution to the problem, why isn't this addressed
by the people that design the programming language?
Because there is no good solution, full stop.
The problem you are talking about has nothing to do with C++. It's an artefact from C- the old "translation unit". Fixing this problem would require re-doing the compilation model. The C++ Committee has been trying for years to make this happen without breaking every single line of existing C++ out there (which is a bigger consideration), but it's not a trivial problem. Fixing it would require vast changes.
Also, Clang has way better performance, and newer versions of GCC which are variadic-template-equipped can do as well.
Yes. Compiling templates is memory consuming. Some implementations suck more than others. In my personal experience, out of GCC, MSVC and Clang, the latter is the best at managing its memory use.
You can split your huge source files into several smaller ones. That would even out the load over several compile steps.
The people who designed the programming language only cared very little for the implementation, to give compiler writers enough freedom to excel and compete. Or Suck.

Explanation of CUDA C and C++

Can anyone give me a good explanation as to the nature of CUDA C and C++? As I understand it, CUDA is supposed to be C with NVIDIA's GPU libraries. As of right now CUDA C supports some C++ features but not others.
What is NVIDIA's plan? Are they going to build upon C and add their own libraries (e.g. Thrust vs. STL) that parallel those of C++? Are they eventually going to support all of C++? Is it bad to use C++ headers in a .cu file?
CUDA C is a programming language with C syntax. Conceptually it is quite different from C.
The problem it is trying to solve is coding multiple (similar) instruction streams for multiple processors.
CUDA offers more than Single Instruction Multiple Data (SIMD) vector processing, but data streams >> instruction streams, or there is much less benefit.
CUDA gives some mechanisms to do that, and hides some of the complexity.
CUDA is not optimised for multiple diverse instruction streams like a multi-core x86.
CUDA is not limited to a single instruction stream like x86 vector instructions, or limited to specific data types like x86 vector instructions.
CUDA supports 'loops' which can be executed in parallel. This is its most critical feature. The CUDA system will partition the execution of 'loops', and run the 'loop' body simultaneously across an array of identical processors, while providing some of the illusion of a normal sequential loop (specifically CUDA manages the loop "index"). The developer needs to be aware of the GPU machine structure to write 'loops' effectively, but almost all of the management is handled by the CUDA run-time. The effect is hundreds (or even thousands) of 'loops' complete in the same time as one 'loop'.
CUDA supports what looks like if branches. Only processors running code which match the if test can be active, so a subset of processors will be active for each 'branch' of the if test. As an example this if... else if ... else ..., has three branches. Each processor will execute only one branch, and be 're-synched' ready to move on with the rest of the processors when the if is complete. It may be that some of the branch conditions are not matched by any processor. So there is no need to execute that branch (for that example, three branches is the worst case). Then only one or two branches are executed sequentially, completing the whole if more quickly.
There is no 'magic'. The programmer must be aware that the code will be run on a CUDA device, and write code consciously for it.
CUDA does not take old C/C++ code and auto-magically run the computation across an array of processors. CUDA can compile and run ordinary C and much of C++ sequentially, but there is very little (nothing?) to be gained by that because it will run sequentially, and more slowly than a modern CPU. This means the code in some libraries is not (yet) a good match with CUDA capabilities. A CUDA program could operate on multi-kByte bit-vectors simultaneously. CUDA isn't able to auto-magically convert existing sequential C/C++ library code into something which would do that.
CUDA does provides a relatively straightforward way to write code, using familiar C/C++ syntax, adds a few extra concepts, and generates code which will run across an array of processors. It has the potential to give much more than 10x speedup vs e.g. multi-core x86.
Edit - Plans: I do not work for NVIDIA
For the very best performance CUDA wants information at compile time.
So template mechanisms are the most useful because it gives the developer a way to say things at compile time, which the CUDA compiler could use. As a simple example, if a matrix is defined (instantiated) at compile time to be 2D and 4 x 8, then the CUDA compiler can work with that to organise the program across the processors. If that size is dynamic, and changes while the program is running, it is much harder for the compiler or run-time system to do a very efficient job.
EDIT:
CUDA has class and function templates.
I apologise if people read this as saying CUDA does not. I agree I was not clear.
I believe the CUDA GPU-side implementation of templates is not complete w.r.t. C++.
User harrism has commented that my answer is misleading. harrism works for NVIDIA, so I will wait for advice. Hopefully this is already clearer.
The hardest stuff to do efficiently across multiple processors is dynamic branching down many alternate paths because that effectively serialises the code; in the worst case only one processor can execute at a time, which wastes the benefit of a GPU. So virtual functions seem to be very hard to do well.
There are some very smart whole-program-analysis tools which can deduce much more type information than the developer might understand. Existing tools might deduce enough to eliminate virtual functions, and hence move analysis of branching to compile time. There are also techniques for instrumenting program execution which feeds directly back into recompilation of programs which might reach better branching decisions.
AFAIK (modulo feedback) the CUDA compiler is not yet state-of-the-art in these areas.
(IMHO it is worth a few days for anyone interested, with a CUDA or OpenCL-capable system, to investigate them, and do some experiments. I also think, for people interested in these areas, it is well worth the effort to experiment with Haskell, and have a look at Data Parallel Haskell)
CUDA is a platform (architecture, programming model, assembly virtual machine, compilation tools, etc.), not just a single programming language. CUDA C is just one of a number of language systems built on this platform (CUDA C, C++, CUDA Fortran, PyCUDA, are others.)
CUDA C++
Currently CUDA C++ supports the subset of C++ described in Appendix D ("C/C++ Language Support") of the CUDA C Programming Guide.
To name a few:
Classes
__device__ member functions (including constructors and destructors)
Inheritance / derived classes
virtual functions
class and function templates
operators and overloading
functor classes
Edit: As of CUDA 7.0, CUDA C++ includes support for most language features of the C++11 standard in __device__ code (code that runs on the GPU), including auto, lambda expressions, range-based for loops, initializer lists, static assert, and more.
Examples and specific limitations are also detailed in the same appendix linked above. As a very mature example of C++ usage with CUDA, I recommend checking out Thrust.
Future Plans
(Disclosure: I work for NVIDIA.)
I can't be explicit about future releases and timing, but I can illustrate the trend that almost every release of CUDA has added additional language features to get CUDA C++ support to its current (In my opinion very useful) state. We plan to continue this trend in improving support for C++, but naturally we prioritize features that are useful and performant on a massively parallel computational architecture (GPU).
Not realized by many, CUDA is actually two new programming languages, both derived from C++. One is for writing code that runs on GPUs and is a subset of C++. Its function is similar to HLSL (DirectX) or Cg (OpenGL) but with more features and compatibility with C++. Various GPGPU/SIMT/performance-related concerns apply to it that I need not mention. The other is the so-called "Runtime API," which is hardly an "API" in the traditional sense. The Runtime API is used to write code that runs on the host CPU. It is a superset of C++ and makes it much easier to link to and launch GPU code. It requires the NVCC pre-compiler which then calls the platform's C++ compiler. By contrast, the Driver API (and OpenCL) is a pure, standard C library, and is much more verbose to use (while offering few additional features).
Creating a new host-side programming language was a bold move on NVIDIA's part. It makes getting started with CUDA easier and writing code more elegant. However, truly brilliant was not marketing it as a new language.
Sometimes you hear that CUDA would be C and C++, but I don't think it is, for the simple reason that this impossible. To cite from their programming guide:
For the host code, nvcc supports whatever part of the C++ ISO/IEC
14882:2003 specification the host c++ compiler supports.
For the device code, nvcc supports the features illustrated in Section
D.1 with some restrictions described in Section D.2; it does not
support run time type information (RTTI), exception handling, and the
C++ Standard Library.
As I can see, it only refers to C++, and only supports C where this happens to be in the intersection of C and C++. So better think of it as C++ with extensions for the device part rather than C. That avoids you a lot of headaches if you are used to C.
What is NVIDIA's plan?
I believe the general trend is that CUDA and OpenCL are regarded as too low level techniques for many applications. Right now, Nvidia is investing heavily into OpenACC which could roughly be described as OpenMP for GPUs. It follows a declarative approach and tackles the problem of GPU parallelization at a much higher level. So that is my totally subjective impression of what Nvidia's plan is.

Compiling FORTRAN IV or convert to FORTRAN 77?

I've got some code I need to run, written in FORTRAN IV. What are my options for running this code? Is there an application out there that can compile and run FORTRAN IV code on a PC? Or if possible I am looking for a utility to convert Fortran IV code to FORTRAN 77 or later. I have little experience in Fortran and in programing in general.
Thanks for you help
Intel's Fortran compiler supports Fortran IV. If you don't want to go that way, there are some conversion utilities mentioned in this question --- but none of them sound very promising.
Very few features have been deleted from Fortran. A few more features have been marked as obsolescent in the more recent language standards. But the compilers tend to support most or all of these features because some customers don't want to recode working legacy programs just because they use "bad" features. Sometimes one has to use compiler options to use some of these features. So I'd just pick a compiler and try it on the existing code. There are many to choose from. Maybe get a trial version to see whether it works before paying your money or use a free one.
Another possible problem is that your code base might have non-standard features. In pre Fortran-90 days there was less concern with language standards and some vendors added extra features for user convenience and to differentiate their product. If present, such features might cause greater problems and require recoding.
Probably the most important feature of Fortran IV is that do loops check the variable at the end. Thus do loops always execute at least once. This was a big issue when it was changed to the current method.
n = 5
do 99 i = n, 4
write(*,*) i
99 continue
Will print out "5" in Fortran IV, and nothing in Fortran 77. Codes that use this feature can be hard to port.
I'll just add that if the code does use non-standard features that your compiler doesn't handle, the maintainers at fortranwiki.org maintain a nice list of explanations of, and workarounds for, many such contstructs on their Modernizing Old Fortran page.
The obvious answer is to run it as is on MVS 3.8J under the Hercules emulator. MVS 3.8J and IBM FORTRAN G and FORTRAN H are public domain.
You will be able to compile and run the code as if you had a real mainframe! The IBM FORTRAN compilers support (defined, actually) the full FORTRAN IV language along with IBM extensions.