Getting started with a new code in an unfamiliar language - fortran

I'm starting a new project in a language I'm less familiar with (FORTRAN) and am in the 'discovery' phase. Normally reading through and figuring out code is a fairly simple task, however, this code is rather large and not so structured. Are there any methods/tips/tricks/tools to mapping out 50k lines of rather dense code?

Is it Fortran IV (unlikely), 77, or 90/95? The language changed a lot with these revisions. Some of the gotchas that High-Performance Mark listed were common in Fortran IV but uncommon in 77, while others were still common in 77. The suggestion to check the code with maximum compiler warnings is excellent -- even use two compilers.
I'd start by diagramming the subroutine structure, and getting a high-level view of what they do. Then get a deeper understanding of the areas that need to be changed.
There are tools for analyzing and even improving Fortran code, e.g., http://www.polyhedron.com/pf-plusfort0html or http://www.crescentbaysoftware.com/vast_77to90.html.

When I coded Fortran (F77) thirty (yikes!) years ago, we had limited facilities to automatically flowchart an unknown codebase. It was ugly, and limited to the real-estate that a plotter bed could supply. As #Simon mentions, you can (and could back then, with some versions) also use a debugger.
Now, interactive exploration is easier. Additionally, you can experiment with IDEs. I have not personally tried it, as Fortran is no longer my key development language, but Photran is an Eclipse plug-in for Fortran, and appears to be under active development (last release was this month).

The debugger is your friend - if you have one for Fortran.
Before you go too much further I would familiarise yourself with the basic syntax of the language plus any foibles like assumptions about variable types from their names and positions of declarations etc. If you don't get that stuff then you are likely to get very lost even with a helpful debugger.
Remember as well that structure is sometimes language dependent. The code you are looking at may be badly structured for the languages you are used to but may be very well structured for Fortran, which has its own set of peculiarities. I think I am just saying have an open mind to start with otherwise you'll be carrying around the unnecessary predisposition that the code you are looking at is bad. It may be, but it may just be something you are not used to.
Best of luck. I rather liked Fortran when I programmed in it for a living about 20 years ago, and it is still the language of choice for some applications because of computation speeds on some platforms. Still quite a lot of it in academia.

I always find the starting point of execution, or where other code (that I'm not working on) calls the code I'm examining. Then I just start reading through it from there, following method calls as necessary to figure out what's going on.

Take heart. One of Fortran's virtues is that it is very simple. Unless you find a code which has been programmed to take advantage of 'clever' tricks. I suggest that the first thing you do is to run your program through a compiler with syntax-checking and standards-compliance turned up to the max. Old (pre-Fortran 90) FORTRAN is notorious for the clever tricks that people used to get round the language's limitations. Some of the gotchas for programmers more familiar with modern languages:
-- common blocks; and other mechanisms for global state; especially bad are common blocks which are used to rename and redefine variables;
-- equivalences (horrid, but you might trip over them);
-- fixed-format source form;
-- use of CONTINUE statement, and the practice of having multiple loops ending at the same CONTINUE statement;
-- implicit declaration of variables (to sort these out, insert the line IMPLICIT NONE at immediately after the PROGRAM, MODULE, SUBROUTINE or FUNCTION statement everywhere they occur);
-- multiple entry points into sub-programs;
-- and a few others I'm so familiar with I can't recall them.
If these mean nothing to you, they soon will. And finally, you might want to look at Understand for Fortran. It costs, but it's very useful.
Regards
Mark

Are you running on Linux or OpenSolaris? If so, the Sun Studio Fortran compiler is one of the best. And the Sun Studio IDE understands Fortran and comes with a debugger. http://developers.sun.com/sunstudio

I'm sleepy so I'll be short :)
Start by grepping out (or whatever tool you use) program statements, subroutine statements, function statements and the like. Maybe modules and such if f90 is used. Draw a kind of diagram based on that, on which you'll see what calls what (what uses what subroutines, functions and the like).
After you've got a general view of the situation, get to data. Fortran requires everything to be declared at the start of program/subroutine ... so the first lines should give you declarations. After you've putted those in the diagram you just made, you should have a very clear situation by that time.
Now, the next step depends really on what you want to do with it.

#ldigas: "Fortran requires everything to be declared at the start of program/subroutine ... "
No, unless IMPLICIT NONE appears at the start of a routine, Fortran uses implicit typing. So variable names starting with A-H and O-Z are typed REAL, and I-N are INTEGER. (Which makes GOD a REAL variable, and INT, luckily, an INTEGER, implicitly.) Remember, Fortran (actually back then it was FORTRAN) was designed by scientists and mathematicians, for whom i, j, k, l, m, and n were ALWAYS integers.
With IMPLICIT NONE, you are forced to explicitly type all variables, as you would in C, C++, or Java. So you could have INTEGER GOD, and REAL INT.

Related

Compiling via C/C++ and compiler bugs/limits

I'm looking at the possibility of implementing a high-level (Lisp-like) language by compiling via C (or possibly C++ if exceptions turn out to be useful enough); this is a strategy a number of projects have previously used. Of course, this would generate C code unlike, and perhaps in some dimensions exceeding the complexity of, anything you would write by hand.
Modern C compilers are highly reliable in normal usage, but it's hard to know what bugs might be lurking in edge cases under unusual stresses, particularly if you go over some "well no programmer would ever write an X bigger than Y" hidden limit.
It occurs to me the coincidence of these facts might lead to unhappiness.
Are there any known cases, or is there a good way to find cases, of generated code tripping over edge case bugs/limits in reasonably recent versions of major compilers (e.g. GCC, Microsoft C++, Clang)?
This may not quite be the answer you were looking for, but quite a few years ago, I worked on a project, where parts of the system was written in some higher level language, where it was easy to build state machines in processes. This language produced C-code that was then compiled. The compiler we were using for this project was gcc (version around 2.95 - don't quote me on that, but pre-3.0 for sure). We did run into a couple of code generation bugs, but that was, from my memory more to do with using a not-so-popular processor [revealing which processor may reveal something I shouldn't about the project, so I'd rather not say what it was, even if it was a very long time ago].
A colleague close to me was investigating one of those code generation bugs, which was in a function of around 200k lines, all of the function a big switch-statement, with each case in the switch statement being around 50-1000 lines each (with several layers of sub-switch statements inside it).
From my memory of it, the code was crashing because it produced an invalid operation or stored something in a register already occupied for something else, so once you hit the right bit of code, it would fail due to an illegal memory access - and it had nothing to do with the long size of the code, because my colleague managed to get it down to about 30 lines of code eventually (after a lot of "lets cut this out and see if it still goes wrong"), and after a few days we had a new version of the compiler with a fix. Nice to know that your many thousands of dollars to pay for the compiler service contract is worth having at least sometimes...
My point here is that modern compilers tolerate a lot of large code. There are also minimum limits that "a compliant compiler must support at least of ". For example, I believe (from memory, again), that the compiler needs to support 127 levels of nested statements (that is, a combination of 127 if, switch, while and do-while) within a function. And, from a discussion somewhere (which is where the "the compiler should support 127 levels of nested statements" comes from), we found that MSVC and GCC both support a whole lot more (enough that we gave up on finding it...)
Short answer:
You have no choice if you need the performance of a compiler rather than the ease of life of an interpreter (or pre-compiler + interpreter). You will be compiling into some lower level language, and C is the assembly language of today, with C++ being about as available and as apt for the task as C. There is no reason why you should fear this route. It is actually, in some sense, a very common route.
Long answer:
Generated code is in no way unusual. Even relatively modest uses of generated code are resulting in C source that is "unlike what any programmer would ever write", in terms of quantity (trivial code patterns repeated with small variations millions of times), or quality (combinations of language features that a human would never use but that are still, say, legal C++). There are also quite a few compilers that compile into C or C++, some famous and well known to people who wrote the C and C++ language standards.
The most common C and C++ compilers cope well with generated code. Yes, they are never perfect.
Compilers may have various simple limitations, such as maximum code line
length; these tend to be documented and easy to comply with in your generator once you run into them.
Compilers may have defects.
Some kinds of defects are actually less of concern for generated code than for hand written code. Your code generator actually gives you a degree of freedom to deal with many situations in a systematic way as soon as you start understanding the pattern and caring about the problem.
Defects that actually result in the compiler not compiling valid code correctly are quickly discovered, given enough users of the compiler. They are treated by compiler vendors as particularly high priority defects. Even if the compiler is essentially dead or serving a niche market, so that no fix is available, plenty of information tends to be available on the Internet, including people's existing experience of working around the defect (different compiler, different code construction, different compiler switches... the solution varies a lot and may feel awkward but nobody gives up on their job just because some software is buggy, right)? So there is usually a choice of searchable solutions.
It's often good striving for portability across compilers, but also knowing and tracking the limits of portability. If you have not tested a particular C or C++ compiler very well, do not claim that it would work as part of your toolset.
You are asking an implied question between C and C++. Well, there are more shades of grey here. C++ is a very rich language. You can use almost all C++ features for good purposes within your generator, but in some cases you should ask yourself whether a particular major feature could become a liability, costing you more than it brings to you. For example, different compilers use different strategies for template instantiation. Implicit instantiation can lead to unforeseen complexity with portable generated code. If templates do help you design the generator tremendously, don't hesitate to use them; but if you only have a marginal use case for them, remember that you have a better reason than most people to restrict the language you generate into.
There are all sorts of implementation defined limits in C. Some well defined and visible to the programmer (think numeric limits), others are not. In my copy of the draft standard, section 5.2.4.1 details the lower bounds on these limits:
5.2.4 Environmental limits
Both the translation and execution environments constrain the implementation of
language translators and libraries. The following summarizes the language-related
environmental limits on a conforming implementation; the library-related limits are
discussed in clause 7.
5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:18)
— 127 nesting levels of blocks
— 63 nesting levels of conditional inclusion
— 12 pointer, array, and function declarators (in any combinations) modifying an arithmetic, structure, union, or void type in a declaration
[...]
18) Implementations should avoid imposing fixed translation limits whenever possible.
I can't say for sure if your translator is likely to hit any of these or if the C compiler(s) you're targeting will have a problem even if you do, but I think you'll probably be fine.
As for bugs:
Clang - http://llvm.org/bugs/buglist.cgi?bug_status=all&product=clang
GCC - http://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=all&product=gcc
Visual Studio - https://connect.microsoft.com/VisualStudio/feedback
It may sound obvious but the only way to really know is by testing. If you can't do all the effort yourself, at least make your product cross-platform so that people can easily test for you! If people like your project, they are usually willing to submit bug reports or even patches for free :)

How much time would it take to write a C++ compiler using flex/yacc?

How much time would it take to write a C++ compiler using lex/yacc?
Where can I get started with it?
There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >> for example is crucially dependent on parsing context.
Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.
yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.
C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.
It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.
Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.
It will probably take you years, and you'll probably switch to some other parser generator in the process.
Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).
Existing C++ Front Ends:
The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.
GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.
Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.
Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.
A long time, and lex and yacc won't help
If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.
I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.
For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.
As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).
Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.
But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.
But then you have the rest of the compiler to consider:
Preprocessor
AST construction
Semantic analysis and type checking
Control, Data flow, and pointer analysis
Basic code generation
Optimizations
Register allocation
Final Code Generation
Debugging support
I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.
Expect building a C++ compiler to take you years.
(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).
EDIT: May, 2015. The original answer was written in 2010; we now have 11 years invested, taking us up through C++14. The point is that it is an endless, big effort to build one of these.
Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.
This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.
You're not the first person to think that this is fun. Here's a nice blog-style article on the topic:
Parsing C++
Here's an important quote from the article:
"After lots of investigation, I
decided that writing a
parser/analysis-tool for C++ is
sufficiently difficult that it's
beyond what I want to do as a hobby."
The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:
ANTLR Grammars (contain several grammars for C++)
A YACC-able C++ 2.1 Grammar and the resulting ambiguities
Parsing and Processing C++ Code (Wikipedia)
Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor.
It depends on how you do it.
How much pre-made components do you plan to use?
You need to get the description of the syntax and its token from somewhere.
For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser....
You can get a c preprocessor from boost project..
You need to create a test suite to test your compiler automatically.
It can take a year if you work on it each day or much less you have more talent and motivation.
Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.
Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).
If you do a google for "C++ grammars" there are a couple around to get you started.
C++ LEX Tokens: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y
A few years - if you can get research grant to re-write new lex/yacc :-)
People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).
The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.
What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).
Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))
A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.
Recursive decent is a good choice to parse C++. GCC and clang use it.
The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.
In either case, writing a C++ compiler is a BIG job.
Well, what do you mean by write a compiler?
I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.
Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.
The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.
With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.

How can I make my own C++ compiler understand templates, nested classes, etc. strong features of C++?

It is a university task in my group to write a compiler of C-like language. Of course I am going to implement a small part of our beloved C++.
The exact task is absolutely stupid, and the lecturer told us it need to be self-compilable (should be able to compile itself) - so, he meant not to use libraries such as Boost and STL. He also does not want us to use templates because it is hard to implement.
The question is - is it real for me, as I`m going to write this project on my own, with the deadline at the end of May - the middle of June (this year), to implement not only templates, but also nested classes, namespaces, virtual functions tables at the level of syntax analysis?
PS I am not noobie in C++
Stick to doing a C compiler.
Believe me, it's hard enough work building a decent C compiler, especially if its expected to compile itself. Trying to support all the C++ features like nested classes and templates will drive you insane. Perhaps a group could do it, but on your own, I think a C compiler is more than enough to do.
If you are dead set on this, at least implement a C-like language first (so you have something to hand in). Then focus on showing off.
"The exact task is absolutely stupid" - I don't think you're in a position to make that judgment fairly. Better to drop that view.
"I`m going to write this project on my own" - you said it's a group project. Are you saying that your group doesn't want to go along with your view that it should morph into C++, so you're taking off and working on your own? There's another bit I'd recommend changing.
It doesn't matter how knowledgable you are about C++. Your ability with grammars, parsers, lexers, ASTs, and code generation seems far more germane.
Without knowing more about you or the assignment, I'd say that you'd be doing well to have the original assignment done by the end of May. That's three months away. Stick to the assignment. It might surprise you with its difficulty.
If you finish early, and fulfill your obligation to your team, I'd say you should feel free to modify what's produced to add C++ features.
I'll bet it took Bjarne Stroustrup more than three months to add objects to C. Don't overestimate yourself or underestimate the original assignment.
No problem. And while you're at it, why not implement an operating system for it to run on too.
Follow the assignment. Write a compiler for a C-like language!
What I'd do is select a subset of C. Remove floating-point datatypes and every other feature that isn't necessary in building your compiler.
Writing a C compiler is a lot of work. You won't be able to do that in a couple of months.
Writing a C++ compiler is downright insane. You wouldn't be able to do that in 5 years.
I will like to stress a few points already mentioned and give a few references.
1) STICK TO THE 1989 ANSI C STANDARD WITH NO OPTIMIZATION.
2) Don't worry, with proper guidance, good organization and a fair amount of hard work this is doable.
3) Read the The C Programming Language cover to cover.
4) Understand important concepts of compiler development from the Dragon Book.
5) Take a look at lcc both the code as well as the book.
6) Take a look at Lex and Yacc (or Flex and Bison)
7) Writing a C compiler (up to the point it can self compile) is a rite of passage ritual among programmers. Enjoy it.
For a class project, I think that requiring the compiler to be able to compile itself is a bit much to ask. I assume that this is what was meant by stupid in the question. It means that you need to figure out in advance exactly how much of C you are going to implement, and stick to that in building the compiler. So, building a symbol table using primitives rather than just using an STL map. This might be useful for a data structure course, but misses the point for a compiler course. It should be about understanding the issues involved with the compiler, and chosing which data structures to use, not coding the data structures.
Building a compiler is a wonderful way to really understand what happens to your code once the compiler get a hold of it. What is the target language? When I took compilers, it took 3 of us all semester to build a compiler to go from sorta-pascal to assembly. Its not a trivial task. Its one of those things that seems simple at first, but the more you get into it, the more complicated things get.
You should be able to complete c-like language within the time frame. Assuming you are taking more than 1 course, that is exactly what you might be able to do in time. C++ is also doable but with a lot more extra hours to put it. Expecing to do c++ templates/virtual functions is overexpecting yourself and you might fail in the assignment all together. So it's better stick with a c subset compiler and finish it in time. You should also consider the time it takes for QA. If you want to be thorough QA itself will also take good time.
Namespaces or nested clases, either virtual functions are at syntax level quite simple, its just one or two more rules to parser. It is much more complicated at higher levels, at deciding, which function / class choose (name shadowing, ambiguous names between namespaces, etc.), or when compiling to bytecode/running AST. So - you may be able to write these, but if isn't necessary, skip it, and write just bare functional model.
If you are talking about a complete compiler, with code generation, then forget it. If you just intend to do the lexical & syntactic analysis side of things, then some form of templating may just about be doable in the time frame, depending on what compiler building tools you use.

Converting C source to C++

How would you go about converting a reasonably large (>300K), fairly mature C codebase to C++?
The kind of C I have in mind is split into files roughly corresponding to modules (i.e. less granular than a typical OO class-based decomposition), using internal linkage in lieu private functions and data, and external linkage for public functions and data. Global variables are used extensively for communication between the modules. There is a very extensive integration test suite available, but no unit (i.e. module) level tests.
I have in mind a general strategy:
Compile everything in C++'s C subset and get that working.
Convert modules into huge classes, so that all the cross-references are scoped by a class name, but leaving all functions and data as static members, and get that working.
Convert huge classes into instances with appropriate constructors and initialized cross-references; replace static member accesses with indirect accesses as appropriate; and get that working.
Now, approach the project as an ill-factored OO application, and write unit tests where dependencies are tractable, and decompose into separate classes where they are not; the goal here would be to move from one working program to another at each transformation.
Obviously, this would be quite a bit of work. Are there any case studies / war stories out there on this kind of translation? Alternative strategies? Other useful advice?
Note 1: the program is a compiler, and probably millions of other programs rely on its behaviour not changing, so wholesale rewriting is pretty much not an option.
Note 2: the source is nearly 20 years old, and has perhaps 30% code churn (lines modified + added / previous total lines) per year. It is heavily maintained and extended, in other words. Thus, one of the goals would be to increase mantainability.
[For the sake of the question, assume that translation into C++ is mandatory, and that leaving it in C is not an option. The point of adding this condition is to weed out the "leave it in C" answers.]
Having just started on pretty much the same thing a few months ago (on a ten-year-old commercial project, originally written with the "C++ is nothing but C with smart structs" philosophy), I would suggest using the same strategy you'd use to eat an elephant: take it one bite at a time. :-)
As much as possible, split it up into stages that can be done with minimal effects on other parts. Building a facade system, as Federico Ramponi suggested, is a good start -- once everything has a C++ facade and is communicating through it, you can change the internals of the modules with fair certainty that they can't affect anything outside them.
We already had a partial C++ interface system in place (due to previous smaller refactoring efforts), so this approach wasn't difficult in our case. Once we had everything communicating as C++ objects (which took a few weeks, working on a completely separate source-code branch and integrating all changes to the main branch as they were approved), it was very seldom that we couldn't compile a totally working version before we left for the day.
The change-over isn't complete yet -- we've paused twice for interim releases (we aim for a point-release every few weeks), but it's well on the way, and no customer has complained about any problems. Our QA people have only found one problem that I recall, too. :-)
What about:
Compiling everything in C++'s C subset and get that working, and
Implementing a set of facades leaving the C code unaltered?
Why is "translation into C++ mandatory"? You can wrap the C code without the pain of converting it into huge classes and so on.
Your application has lots of folks working on it, and a need to not-be-broken.
If you are serious about large scale conversion to an OO style, what
you need is massive transformation tools to automate the work.
The basic idea is to designate groups of data as classes, and then
get the tool to refactor the code to move that data into classes,
move functions on just that data into those classes,
and revise all accesses to that data to calls on the classes.
You can do an automated preanalysis to form statistic clusters to get some ideas,
but you'll still need an applicaiton aware engineer to decide what
data elements should be grouped.
A tool that is capable of doing this task is our DMS Software Reengineering
Toolkit.
DMS has strong C parsers for reading your code, captures the C code
as compiler abstract syntax trees, (and unlike a conventional compiler)
can compute flow analyses across your entire 300K SLOC.
DMS has a C++ front end that can be used as the "back" end;
one writes transformations that map C syntax to C++ syntax.
A major C++ reengineering task on a large avionics system gives
some idea of what using DMS for this kind of activity is like.
See technical papers at
www.semdesigns.com/Products/DMS/DMSToolkit.html,
specifically
Re-engineering C++ Component Models Via Automatic Program Transformation
This process is not for the faint of heart. But than anybody
that would consider manual refactoring of a large application
is already not afraid of hard work.
Yes, I'm associated with the company, being its chief architect.
I would write C++ classes over the C interface. Not touching the C code will decrease the chance of messing up and quicken the process significantly.
Once you have your C++ interface up; then it is a trivial task of copy+pasting the code into your classes. As you mentioned - during this step it is vital to do unit testing.
GCC is currently in midtransition to C++ from C. They started by moving everything into the common subset of C and C++, obviously. As they did so, they added warnings to GCC for everything they found, found under -Wc++-compat. That should get you on the first part of your journey.
For the latter parts, once you actually have everything compiling with a C++ compiler, I would focus on replacing things that have idiomatic C++ counterparts. For example, if you're using lists, maps, sets, bitvectors, hashtables, etc, which are defined using C macros, you will likely gain a lot by moving these to C++. Likewise with OO, you'll likely find benefits where you are already using a C OO idiom (like struct inheritence), and where C++ will afford greater clarity and better type checking on your code.
Your list looks okay except I would suggest reviewing the test suite first and trying to get that as tight as possible before doing any coding.
Let's throw another stupid idea:
Compile everything in C++'s C subset and get that working.
Start with a module, convert it in a huge class, then in an instance, and build a C interface (identical to the one you started from) out of that instance. Let the remaining C code work with that C interface.
Refactor as needed, growing the OO subsystem out of C code one module at a time, and drop parts of the C interface when they become useless.
Probably two things to consider besides how you want to start are on what you want to focus, and where you want to stop.
You state that there is a large code churn, this may be a key to focus your efforts. I suggest you pick the parts of your code where a lot of maintenance is needed, the mature/stable parts are apparently working well enough, so it is better to leave them as they are, except probably for some window dressing with facades etc.
Where you want to stop depends on what the reason is for wanting to convert to C++. This can hardly be a goal in itself. If it is due to some 3rd party dependency, focus your efforts on the interface to that component.
The software I work on is a huge, old code base which has been 'converted' from C to C++ years ago now. I think it was because the GUI was converted to Qt. Even now it still mostly looks like a C program with classes. Breaking the dependencies caused by public data members, and refactoring the huge classes with procedural monster methods into smaller methods and classes never has really taken off, I think for the following reasons:
There is no need to change code that is working and that does not need to be enhanced. Doing so introduces new bugs without adding functionality, and end users don't appreciate that;
It is very, very hard to do refactor reliably. Many pieces of code are so large and also so vital that people hardly dare touching it. We have a fairly extensive suite of functional tests, but sufficient code coverage information is hard to get. As a result, it is difficult to establish whether there are already sufficient tests in place to detect problems during refactoring;
The ROI is difficult to establish. The end user will not benefit from refactoring, so it must be in reduced maintenance cost, which will increase initially because by refactoring you introduce new bugs in mature, i.e. fairly bug-free code. And the refactoring itself will be costly as well ...
NB. I suppose you know the "Working effectively with Legacy code" book?
You mention that your tool is a compiler, and that: "Actually, pattern matching, not just type matching, in the multiple dispatch would be even better".
You might want to take a look at maketea. It provides pattern matching for ASTs, as well as the AST definition from an abstract grammar, and visitors, tranformers, etc.
If you have a small or academic project (say, less than 10,000 lines), a rewrite is probably your best option. You can factor it however you want, and it won't take too much time.
If you have a real-world application, I'd suggest getting it to compile as C++ (which usually means primarily fixing up function prototypes and the like), then work on refactoring and OO wrapping. Of course, I don't subscribe to the philosophy that code needs to be OO structured in order to be acceptable C++ code. I'd do a piece-by-piece conversion, rewriting and refactoring as you need to (for functionality or for incorporating unit testing).
Here's what I would do:
Since the code is 20 years old, scrap down the parser/syntax analyzer and replace it with one of the newer lex/yacc/bison(or anything similar) etc based C++ code, much more maintainable and easier to understand. Faster to develop too if you have a BNF handy.
Once this is retrofitted to the old code, start wrapping modules into classes. Replace global/shared variables with interfaces.
Now what you have will be a compiler in C++ (not quite though).
Draw a class diagram of all the classes in your system, and see how they are communicating.
Draw another one using the same classes and see how they ought to communicate.
Refactor the code to transform the first diagram to the second. (this might be messy and tricky)
Remember to use C++ code for all new code added.
If you have some time left, try replacing data structures one by one to use the more standardized STL or Boost.

Learning FORTRAN In the Modern Era

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I've recently come to maintain a large amount of scientific calculation-intensive FORTRAN code. I'm having difficulties getting a handle on all of the, say, nuances, of a forty year old language, despite google & two introductory level books. The code is rife with "performance enhancing improvements". Does anyone have any guides or practical advice for de-optimizing FORTRAN into CS 101 levels? Does anyone have knowledge of how FORTRAN code optimization operated? Are there any typical FORTRAN 'gotchas' that might not occur to a Java/C++/.NET raised developer taking over a FORTRAN 77/90 codebase?
You kind of have to get a "feel" for what programmers had to do back in the day. The vast majority of the code I work with is older than I am and ran on machines that were "new" when my parents were in high school.
Common FORTRAN-isms I deal with, that hurt readability are:
Common blocks
Implicit variables
Two or three DO loops with shared CONTINUE statements
GOTO's in place of DO loops
Arithmetic IF statements
Computed GOTO's
Equivalence REAL/INTEGER/other in some common block
Strategies for solving these involve:
Get Spag / plusFORT, worth the money, it solves a lot of them automatically and Bug Free(tm)
Move to Fortran 90 if at all possible, if not move to free-format Fortran 77
Add IMPLICIT NONE to each subroutine and then fix every compile error, time consuming but ultimately necessary, some programs can do this for you automatically (or you can script it)
Moving all COMMON blocks to MODULEs, low hanging fruit, worth it
Convert arithmetic IF statements to IF..ELSEIF..ELSE blocks
Convert computed GOTOs to SELECT CASE blocks
Convert all DO loops to the newer F90 syntax
myloop: do ii = 1, nloops
! do something
enddo myloop
Convert equivalenced common block members to either ALLOCATABLE memory allocated in a module, or to their true character routines if it is Hollerith being stored in a REAL
If you had more specific questions as to how to accomplish some readability tasks, I can give advice. I have a code base of a few hundred thousand lines of Fortran which was written over the span of 40 years that I am in some way responsible for, so I've probably run across any "problems" you may have found.
Legacy Fortran Soapbox
I helped maintain/improve a legacy Fortran code base for quite a while and for the most part think sixlettervariables is on the money. That advice though, tends to the technical; a tougher row to hoe is in implementing "good practices".
Establish a required coding style and coding guidelines.
Require a code review (of more than just the coder!) for anything submitted to the code base. (Version control should be tied to this process.)
Start building and running unit tests; ditto benchmark or regression tests.
These might sound like obvious things these days, but at the risk of over-generalizing, I claim that most Fortran code shops have an entrenched culture, some started before the term "software engineering" even existed, and that over time what comes to dominate is "Get it done now". (This is not unique to Fortran shops by any means.)
Embracing Gotchas
But what to do with an already existing, grotty old legacy code base? I agree with Joel Spolsky on rewriting, don't. However, in my opinion sixlettervariables does point to the allowable exception: Use software tools to transition to better Fortran constructs. A lot can be caught/corrected by code analyzers (FORCHECK) and code rewriters (plusFORT). If you have to do it by hand, make sure you have a pressing reason. (I wish I had on hand a reference to the number of software bugs that came from fixing software bugs, it is humbling. I think some such statistic is in Expert C Programming.)
Probably the best offense in winning the game of Fortran gotchas is having the best defense: Knowing the language fairly well. To further that end, I recommend ... books!
Fortran Dead Tree Library
I have had only modest success as a "QA nag" over the years, but I have found that education does work, some times inadvertently, and that one of the most influential things is a reference book that someone has on hand. I love and highly recommend
Fortran 90/95 for Scientists and Engineers, by Stephen J. Chapman
The book is even good with Fortran 77 in that it specifically identifies the constructs that shouldn't be used and gives the better alternatives. However, it is actually a textbook and can run out of steam when you really want to know the nitty-gritty of Fortran 95, which is why I recommend
Fortran 90/95 Explained, by Michael Metcalf & John K. Reid
as your go-to reference (sic) for Fortran 95. Be warned that it is not the most lucid writing, but the veil will lift when you really want to get the most out of a new Fortran 95 feature.
For focusing on the issues of going from Fortran 77 to Fortran 90, I enjoyed
Migrating to Fortran 90, by Jim Kerrigan
but the book is now out-of-print. (I just don't understand O'Reilly's use of Safari, why isn't every one of their out-of-print books available?)
Lastly, as to the heir to the wonderful, wonderful classic, Software Tools, I nominate
Classical FORTRAN, by Michael Kupferschmid
This book not only shows what one can do with "only" Fortran 77, but it also talks about some of the more subtle issues that arise (e.g., should or should not one use the EXTERNAL declaration). This book doesn't exactly cover the same space as "Software Tools" but they are two of the three Fortran programming books that I would tag as "fun".... (here's the third).
Miscellaneous Advice that applies to almost every Fortran compiler
There is a compiler option to enforce IMPLICIT NONE behavior, which you can use to identify problem routines without modifying them with the IMPLICIT NONE declaration first. This piece of advice won't seem meaningful until after the first time a build bombs because of an IMPLICIT NONE command inserted into a legacy routine. (What? Your code review didn't catch this? ;-)
There is a compiler option for array bounds checking, which can be useful when debugging Fortran 77 code.
Fortran 90 compilers should be able to compile almost all Fortran 77 code and even older Fortran code. Turn on the reporting options on your Fortran 90 compiler, run your legacy code through it and you will have a decent start on syntax checking. Some commercial Fortran 77 compilers are actually Fortran 90 compilers that are running in Fortran 77 mode, so this might be relatively trivial option twiddling for whatever build scripts you have.
There's something in the original question that I would caution about. You say the code is rife with "performance enhancing improvements". Since Fortran problems are generally of a scientific and mathematical nature, do not assume these performance tricks are there to improve the compilation. It's probably not about the language. In Fortran, the solution is seldom about efficiency of the code itself but of the underlying mathematics to solve the end problem. The tricks may make the compilation slower, may even make the logic appear messy, but the intent is to make the solution faster. Unless you know exactly what it is doing and why, leave it alone.
Even simple refactoring, like changing dumb looking variable names can be a big pitfall. Historically standard mathematical equations in a given field of science will have used a particular shorthand since the days of Maxwell. So to see an array named B(:) in electromagnetics tells all Emag engineers exactly what is being solved for. Change that at your peril. Moral, get to know the standard nomenclature of the science before renaming too.
As someone with experience in both FORTRAN (77 flavor although it has been a while since I used it seriously) and C/C++ the item to watch out for that immediately jumps to mind are arrays. FORTRAN arrays start with an index of 1 instead of 0 as they do in C/C++/Java. Also, memory arrangement is reversed. So incrementing the first index gives you sequential memory locations.
My wife still uses FORTRAN regularly and has some C++ code she needs to work with now that I'm about to start helping her with. As issues come up during her conversion I'll try to point them out. Maybe they will help.
I have used Fortran starting with the '66 version since 1967 (on an IBM 7090 with 32k words of memory). I then used PL/1 for some time, but later went back to Fortran 95 because it is ideally suited for the matrix/complex-number problems we have. I would like to add to the considerations that much of the convoluted structure of old codes is simply due to the small amount of memory available, forcing such thing like reusing a few lines of code via computed or assigned GOTOs. Another problem is optimization by defining auxiliary variables for every repeated subexpression - compilers simply did not optimize for that. In addition, it was not allowed to write DO i=1,n+1; you had to write n1=n+1; DO i=1,n1. In consequence old codes are overwhelmed with superfluous variables. When I rewrote a code in Fortran 95, only 10% of the variables survived. If you want to make the code more legible, I highly recommend looking for variables that can easily be eliminated.
Another thing I might mention is that for many years complex arithmetic and multidimensional arrays were highly inefficient. That is why you often find code rewritten to do complex calculations using only real variables, and matrices addressed with a single linear index.
Well, in one sense, you're lucky, 'cause Fortran doesn't have much in the way of subtle flow-of-control constructs or inheritance or the like. On the other, it's got some truly amazing gotchas, like the arithmetically-calculated branch-to-numeric-label stuff, the implicitly-typed variables which don't require declaration, the lack of true keywords.
I don't know about the "performance enhancing improvements". I'd guess most of them are probably ineffective, as a couple of decades of compiler technology have made most hinting unnecessary. Unfortunately, you'll probably have to leave things the way they are, unless you're planning to do a massive rewrite.
Anyway, the core scientific calculation code should be fairly readable. Any programming language using infix arithmetic would be good preparation for reading Fortran's arithmetic and assignment code.
Could you explain what you have to do in maintaining the code? Do you really have to modify the code? If you can get away by modifying just the interface to that code instead of the code itself, that would be the best.
The inherent problem when dealing with a large scientific code (not just FORTRAN) is that the underlying mathematics and the implementation are both complex. Almost by default, the implementation has to include code optimization, in order to run within reasonable time frame. This is compounded by the fact that a lot of code in this field is created by scientists / engineers that are expert in their field, but not in software development. Let's just say that "easy to understand" is not the first priority to them (I was one of them, still learning to be a better software developer).
Due to the nature of the problem, I don't think a general question and answer is enough to be helpful. I suggest you post a series of specific questions with code snippet attached. Perhaps starting with the one that gives you the most headache?
I loved FORTRAN, I used to teach and code in it. Just wanted to throw that in. Haven't touched it in years.
I started out in COBOL, when I moved to FORTRAN I felt I was freed. Everything is relative, yeah?
I'd second what has been said above - recognise that this is a PROCEDURAL language - no subtelties - so take it as you see it.
Probably frustrate you to start with.
I started on Fortran IV (WATFIV) on punch cards, and my early working years were VS FORTRAN v1 (IBM, Fortran 77 level). Lots of good advice in this thread.
I would add that you have to distinguish between things done to get the beast to run at all, versus things that "optimize" the code, versus things that are more readable and maintainable. I can remember dealing with VAX overlays in trying to get DOE simulation code to run on IBM with virtual memory (they had to be removed and the whole thing turned into one address space).
I would certainly start by carefully restructuring FORTRAN IV control structures to at least FORTRAN 77 level, with proper indentation and commenting. Try to get rid of primitive control structures like ASSIGN and COMPUTED GOTO and arithmetic IF, and of course, as many GOTOs as you can (using IF-THEN-ELSE-ENDIF). Definitely use IMPLICIT NONE in every routine, to force you to properly declare all variables (you wouldn't believe how many bugs I caught in other people's code -- typos in variable names). Watch out for "premature optimizations" that you're better off letting the compiler handle by itself.
If this code is to continue to live and be maintainable, you owe it to yourself and your successors to make it readable and understandable. Just be certain of what you are doing as you change the code! FORTRAN has lots of peculiar constructs that can easily trip up someone coming from the C side of the programming world. Remember than FORTRAN dates back to the mid-late '50s, when there was no such thing as a science of language and compiler design, just ad hoc hacking together of something (sorry, Dr. B!).
Here's another one that has bit me from time to time. When you are working on FORTRAN code make sure you skip all six initial columns. Every once and a while, I'll only get the code indented five spaces and nothing works. At first glance everything seems okay and then I finally realize that all the lines are starting in column 6 instead of column 7.
For anyone not familiar with FORTRAN, the first 5 columns are for line numbers (=labels), the 6th column is for a continuation character in case you have a line longer than 80 characters (just put something here and the compiler knows that this line is actually part of the one before it) and code always starts in column 7.