Guidelines for writing flexible software? - c++

I've been developing an interpreter in C++ for my (esoteric, if you want) programming language some time now. One of the main things that I have noticed: I start with a flexible concept, and the further I code (Tokenizer->Parser->Interpreter) the less flexible the whole system gets.
For example: I didn't implement an include function at first, yet the interpreter was already up and running - I had extreme difficulties implementing it and it was just like "patching something out" later on. My system had lost flexibility very quickly.
How can I learn to keep relatively small C++ projects as flexible and extensible as possible during development?

If you need to keep
C++ projects as flexible and extensible as possible during development
then you haven't got a product specification, you have no real goal and no way of defining a finished product.
For a commercial product this is the worst situation to be in. To paraphrase one well known blogger (can't remember who) "you haven't got a product until you define what you aren't going to do."
For personal projects this might not be a problem. Chalk it up to experience and remember for future reference. Refactor and move on.

Define the structure of the project before you start coding. Outline your main objectives and think about how can you achieve that.
Code the headers.
Look if it's possible to implement every feature using this set of interfaces
If no -> go back to (2)
If yes -> code .cpp files
Enjoy.
Of course, this doesn't apply to really large projects. But if your design is modular, there shouldn't be any problems to divide the project into separate parts.

Don't fear Evolution (Refactoring).
If there are many class that fit a theme, create a common base class.
Instead of hard coding data members, use pointers to an abstract base class.
For example, instead of using std::ifstream use std::istream.
In my project, I have abstract classes for Reading and Writing. Classes that support reading and writing use these interfaces. I can pass specialized readers to these classes without changing any code. A data base reader would inherit from the base Reader class, and thus can be used anywhere a reader is used.

Related

Organizing large C++ project

Should all C++ code in a project be encapsulated into a single class with main simply calling that class? Or should the main function declare variables and classes.
If you are going to build a large project in C++, you should at the very least read Large Scale C++ Software Design by John Lakos about it. It's a little old but it sounds like you could benefit from the fundamentals in it.
Keep in mind that building a large scale system in any language is a challenge and requires skill and discipline to prevent it falling to pieces very quickly. Don't take it lightly.
That said, if your definition of "large" is different than mine then I may have alternative advice to give you. I'm assuming you're talking about a project where the word "million" will be mentioned in sentences that also contain the words "lines of code".
for large C++ projects, you should create many classes!
main should just kick things off (maybe doing a few housekeeping things) and then calling into a class that will fire up the rest of the system
If it's a class that really makes sense, sure -- but at least IME, that's a fairly rare exception, not the general rule.
Here, I'm presuming that you don't really mean all the code is in one class, but that there's a single top-level class, so essentially all main does is instantiate and use it. That class, in turn, will presumably instantiate and use other subordinate classes.
If you really mean "should all the code being contained in a single class?", then the answer is almost certainly a resounding "no", except for truly minuscule projects. Much of the design of classes in C++ (and most other OO languages) is completely pointless if all the code is in one class.
If you can put your entire project in one class without going insane, your definition of "large" may be different than most people's here. Which is fine -- just keep in mind when you ask people about a "large" c++ project, they will assume you're talking about something that takes multiple person-years to create.
That said, the same principles of encapsulation apply no matter what the size of the project. Break your logic and data into units that make sense and are not too tied together and then organize your class(es) around those divisions. Don't be afraid to try one organization and then refactor it into another organization if you find yourself copy-pasting code, or if you find one class depending too heavily on another. (Or if you find yourself with too many classes and you're creating many objects to accomplish one task where a single object would be cleaner and easier on you.)
Have fun and don't be afraid to experiment a little.
In C++ you should avoid putting entire project in one class, irrespective of big or small. At the max you can try putting it in 1 or 2 namespace (which can be split across the files).
The advantage of having multiple classes are,
Better maintainability of your code
Putting classes in multiple .h and .cpp files (i.e. small modules) help you fast debugging
If all code is in one class and changes are made somewhere then one has to compile whole project. Instead, if project is across modules, one can just compile the module where changes are made. It saves time a lot.
No! Each header/implementation file pair should represent a single class. Placing a large project in one file is a surefire way to disaster: the project become unmaintainable and compiling will take ages. Break up your code in to appropriately sized pieces.
The main function should not declare the classes, rather, the file it contains (often named something like main.cpp, driver.cpp, projectname.cpp) should use #include directives to make the compiler read the declarations in header files. Read up on C++'s separate compilation model for more info.
Some newcomers to C++ find the compilation model - as well as error codes generated when you screw it up - incomprehensible or intimidating and give up thinking it's not worth it. Don't let this be you. Learn how to properly organize your code.

Is it better to have lot of interfaces or just one?

I have been working on this plugin system. I thought I passed design and started implementing. Now I wonder if I should revisit my design. my problem is the following:
Currently in my design I have:
An interface class FileNameLoader for loading the names of all the shared libraries my application needs to load. i.e. Load all files in a directory, Load all files specified in a XML file, Load all files user inputs, etc.
An Interface class LibLoader that actually loads the shared object. This class is only responsible for loading a shared object once its file name has been given. There are various ways one may need to load a shared lib. i.e. Use RTLD_NOW/RTLD_LAZY...., check if lib has been already loaded, etc.
An ABC Plugin which loads the functions I need from a handle to a library once that handle is supplied. There are so many ways this could change.
An interface class PluginFactory which creates Plugins.
An ABC PluginLoader which is the mother class which manages everything.
Now, my problem is I feel that FileNameLoader and LibLoader can go inside Plugin. But this would mean that if someone wanted to just change RTLD_NOW to RTLD_LAZY he would have to change Plugin class. On the other hand, I feel that there are too many classes here. Please give some input. I can post the interface code if necessary. Thanks in advance.
EDIT:
After giving this some thought, I have come to the conclusion that more interfaces is better (In my scenario at least). Suppose there are x implementations of FileNameLoader, y implementations of LibLoader, z implementations of Plugin. If I keep these classes separate, I have to write x + y + z implementation classes. Then I can combine them to get any functionality possible. On the other hand, if all these interfces were in Plugin class, I'd have to write x*y*z implementation classes to get all the possible functionalities which is larger than x + y + z given that there are at least 2 implementations for an interface. This is just one side of it. The other advantage is, the purpose of the interfaces are more clearer when there are more interfaces. At least that is what I think.
My c++ projects generally consists of objects that implement one or more interfaces.
I have found that this approach has the following effects:
Use of interfaces enforces your design.
(my opinion only) ensures a better program design.
Related functionality is grouped into interfaces.
The compiler will let you know if your implementation of the interface is incomplete or incorrect (good for changes to interfaces).
You can pass interface pointers around instead of entire objects.
Passing around interface pointers has the benefit that you're exposing only the functionality required to other objects.
COM employs the use of interfaces heavily, as its modular design is useful for IPC (inter process communication), promotes code reuse and enable backwards compatiblity.
Microsoft use COM extensively and base their OS and most important APIs (DirectX, DirectShow, etc.) on COM, for these reasons, and although it's hardly the most accessible technology, COM's not going away any time soon.
Will these aid your own program(s)? Up to you. If you're going to turn a lot of your code into COM objects, it's definitely the right approach.
The other good stuff you get with interfaces that I've mentioned - make your own judgement as to how useful they'll be to you. Personally, I find interfaces indispensable.
Generally the only time I provide more than one interface, it will be because I have two completely different kinds of clients (eg: clients and The Server). In that case, yes it is perfectly OK.
However, this statement worries me:
I thought I passed design and started
implementing
That's old-fashioned Waterfall thinking. You never are done designing. You will almost always have to do a fairly major redesign the first time a real client tries to use your class. Thereafter every now and then you'll discover edge cases of client use that require (or would greatly benifit by) an extra new call or two, or a slightly different approach to all the calls.
You might be interested in the Interface Segregation Principle, which results in more, smaller interfaces.
"Clients should not be forced to depend on interfaces that they do not use."
More detail on this principle is provided by this paper: http://www.objectmentor.com/resources/articles/isp.pdf
This is part of the Bob Martin's synergistic SOLID principles.
There isn't a golden rule. It'll depend on the scenario, and even then you may find in the future some assumptions have changed and you need to update it accordingly.
Personally I like the way you have it now. You can replace at the top level, or very specific pieces.
Having the One Big Class That Does Everything is wrong. So is having One Big Interface That Defines Everything.

How should I design a mechanism in C++ to manage relatively generic entities within a simulation?

I would like to start my question by stating that this is a C++ design question, more then anything, limiting the scope of the discussion to what is accomplishable in that language.
Let us pretend that I am working on a vehicle simulator that is intended to model modern highway systems. As part of this simulation, entities will be interacting with each other to avoid accidents, stop at stop lights and perhaps eventually even model traffic enforcement with radar guns and subsequent exciting high speed chases.
Being a spatial simulation written in C++, it seems like it would be ideal to start with some kind of Vehicle hierarchy, with cars and trucks deriving from some common base class. However, a common problem I have run in to is that such a hierarchy is usually very rigidly defined, and introducing unexpected changes - modeling a boat for instance - tends to introduce unexpected complexity that tends to grow over time into something quite unwieldy.
This simple aproach seems to suffer from a combinatoric explosion of classes. Imagine if I created a MoveOnWater interface and a MoveOnGround interface, and used them to define Car and Boat. Then lets say I add RadarEquipment. Now I have to do something like add the classes RadarBoat and RadarCar. Adding more capabilities using this approach and the whole thing rapidly becomes quite unreasonable.
One approach I have been investigating to address this inflexibility issue is to do away with the inheritance hierarchy all together. Instead of trying to come up with a type safe way to define everything that could ever be in this simulation, I defined one class - I will call it 'Entity' - and the capabilities that make up an entity - can it drive, can it fly, can it use radar - are all created as interfaces and added to a kind of capability list that the Entity class contains. At runtime, the proper capabilities are created and attached to the entity and functions that want to use these interfaced must first query the entity object and check for there existence. This approach seems to be the most obvious alternative, and is working well for the time being. I, however, worry about the maintenance issues that this approach will have. Effectively any arbitrary thing can be added, and there is no single location in which all possible capabilities are defined. Its not a problem currently, when the total number of things is quite small, but I worry that it might be a problem when someone else starts trying to use and modify the code.
As one potential alternative, I pondered using the template system to achieve type safe while keeping the same kind of flexibility. I imagine I could create entities that inherited whatever combination of interfaces I wanted. Using these objects would entail creating a template class or function that used any combination of the interfaces. One example might be the simple move on road using just the MoveOnRoad interface, whereas more complex logic, like a "high speed freeway chase", could use methods from both MoveOnRoad and Radar interfaces.
Of course making this approach usable mandates the use of boost concept check just to make debugging feasible. Also, this approach has the unfortunate side effect of making "optional" interfaces all but impossible. It is not simple to write a function that can have logic to do one thing if the entity has a RadarEquipment interface, and do something else if it doesn't. In this regard, type safety is somewhat of a curse. I think some trickery with boost any may be able to pull it off, but I haven't figured out how to make that work and it seems like way to much complexity for what I am trying to achieve.
Thus, we are left with the dynamic "list of capabilities" and achieving the goal of having decision logic that drives behavior based on what the entity is capable of becomes trivial.
Now, with that background in mind, I am open to any design gurus telling me where I err'd in my reasoning. I am eager to learn of a design pattern or idiom that is commonly used to address this issue, and the sort of tradeoffs I will have to make.
I also want to mention that I have been contemplating perhaps an even more random design. Even though I my gut tells me that this should be designed as a high performance C++ simulation, a part of me wants to do away with the Entity class and object-orientated foo all together and uses a relational model to define all of these entity states. My initial thought is to treat entities as an in memory database and use procedural query logic to read and write the various state information, with the necessary behavior logic that drives these queries written in C++. I am somewhat concerned about performance, although it would not surprise me if that was a non-issue. I am perhaps more concerned about what maintenance issues and additional complexity this would introduce, as opposed to the relatively simple list-of-capabilities approach.
Encapsulate what varies and Prefer object composition to inheritance, are the two OOAD principles at work here.
Check out the Bridge Design pattern. I visualize Vehicle abstraction as one thing that varies, and the other aspect that varies is the "Medium". Boat/Bus/Car are all Vehicle abstractions, while Water/Road/Rail are all Mediums.
I believe that in such a mechanism, there may be no need to maintain any capability. For example, if a Bus cannot move on Water, such a behavior can be modelled by a NOP behavior in the Vehicle Abstraction.
Use the Bridge pattern when
you want to avoid a permanent binding
between an abstraction and its
implementation. This might be the
case, for example, when the
implementation must be selected or
switched at run-time.
both the abstractions and their
implementations should be extensible
by subclassing. In this case, the
Bridge pattern lets you combine the
different abstractions and
implementations and extend them
independently.
changes in the implementation of an
abstraction should have no impact on
clients; that is, their code should
not have to be recompiled.
Now, with that background in mind, I am open to any design gurus telling me where I err'd in my reasoning.
You may be erring in using C++ to define a system for which you as yet have no need/no requirements:
This approach seems to be the most
obvious alternative, and is working
well for the time being. I, however,
worry about the maintenance issues
that this approach will have.
Effectively any arbitrary thing can be
added, and there is no single location
in which all possible capabilities are
defined. Its not a problem currently,
when the total number of things is
quite small, but I worry that it might
be a problem when someone else starts
trying to use and modify the code.
Maybe you should be considering principles like YAGNI as opposed to BDUF.
Some of my personal favourites are from Systemantics:
"15. A complex system that works is invariably found to have evolved from a simple system that works"
"16. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system."
You're also worring about performance, when you have no defined performance requirements, and no problems with performance:
I am somewhat concerned about
performance, although it would not
surprise me if that was a non-issue.
Also, I hope you know about double-dispatch, which might be useful for implementing anything-to-anything interactions (it's described in some detail in More Effective C++ by Scott Meyers).

For C/C++, When is it beneficial not to use Object Oriented Programming?

I find myself always trying to fit everything into the OOP methodology, when I'm coding in C/C++. But I realize that I don't always have to force everything into this mold. What are some pros/cons for using the OOP methodology versus not? I'm more interested in the pros/cons of NOT using OOP (for example, are there optimization benefits to not using OOP?). Thanks, let me know.
Of course it's very easy to explain a million reasons why OOP is a good thing. These include: design patterns, abstraction, encapsulation, modularity, polymorphism, and inheritance.
When not to use OOP:
Putting square pegs in round holes: Don't wrap everything in classes when they don't need to be. Sometimes there is no need and the extra overhead just makes your code slower and more complex.
Object state can get very complex: There is a really good quote from Joe Armstrong who invented Erlang:
The problem with object-oriented
languages is they’ve got all this
implicit environment that they carry
around with them. You wanted a banana
but what you got was a gorilla holding
the banana and the entire jungle.
Your code is already not OOP: It's not worth porting your code if your old code is not OOP. There is a quote from Richard Stallman in 1995
Adding OOP to Emacs is not clearly an
improvement; I used OOP when working
on the Lisp Machine window systems,
and I disagree with the usual view
that it is a superior way to program.
Portability with C: You may need to export a set of functions to C. Although you can simulate OOP in C by making a struct and a set of functions who's first parameter takes a pointer to that struct, it isn't always natural.
You may find more reasons in this paper entitled Bad Engineering Properties
of Object-Oriented Languages.
Wikipedia's Object Oriented Programming page also discusses some pros and cons.
One school of thought with object-oriented programming is that you should have all of the functions that operate on a class as methods on the class.
Scott Meyers, one of the C++ gurus, actually argues against this in this article:
How Non-Member Functions Improve Encapsulation.
He basically says, unless there's a real compelling reason to, you should keep the function SEPARATE from the class. Otherwise the class can turn into this big bloated unmanageable mess.
Based on experiences in a previous large project, I totally agree with him.
A benefit of non-oop functionality is that it often makes exporting your functionality to different languages easier. For example a simple DLL containing only functions is much easier to use in C#, you can use the P/Invoke to simply call the C++ functions. So in this sense it can be useful for writing extremely time critical algorithms that fit nicely into single/few function calls.
OOP is used a lot in GUI code, computer games, and simulations. Windows should be polymorphic - you can click on them, resize them, and so on. Computer game objects should be polymorphic - they probably have a location, a path to follow, they might have health, and they might have some AI behavior. Simulation objects also have behavior that is similar, but breaks down into classes.
For most things though, OOP is a bit of a waste of time. State usually just causes trouble, unless you have put it safely in the database where it belongs.
I suggest you read Bjarne's Paper about Why C++ is not just an Object-Oriented Programming Language
If we consider, for a moment, not object-orienatation itself but one
of the keystones of object-orientation: encapsulation.
It can be shown that change-propagation probability cannot increase
with distance from the change: if A depends on B and B depends on C,
and we change C, then the probability that A will change
cannot be larger than the proabability that B will
change. If B is a direct dependency on C and A is an indirect
dependency on C, then, more generally, to minimise the potential cost
of any change in a system we must miminimise the potential number of
direct dependencies.
The ISO defines encapsulation as the property that the information
contained in an object is accessible only through interactions at the
interfaces supported by the object.
We use encapsulation to minimise the number of potential dependencies
with the highest change-propagation probability. Basically,
encapsulation mitigates the ripple effect.
Thus one reason not to use encapsulation is when the system is so
small or so unchanging that the cost of potential ripple effects is
negligible. This is also, therefore, a case when OO might not be used
without potentially costly consequences.
Well, there are several alternatives. Non-OOP code in C++ may instead be:
C-style procedural code, or
C++-style generic programming
The only advantages to the first are the simplicity and backwards-compatibility. If you're writing a small trivial app, then messing around with classes is just a waste of time. If you're trying to write a "Hello World", just call printf already. Don't bother wrapping it in a class. And if you're working with an existing C codebase, it's probably not object-oriented, and trying to force it into a different paradigm than it already uses is just a recipe for pain.
For the latter, the situation is different, in that this approach is often superior to "traditional OOP".
Generic programming gives you greater performance (among other things because you often avoid the overhead of vtables, and because with less indirection, the compiler is better able to inline), better type safety (because the exact type is known, rather than hiding it behind an interface), and often cleaner and more concise code as well (STL iterators and algorithms enable much of this, without using a single instance of runtime polymorphism or virtual functions.
OOP is little more than an aging buzzword. A methodology that everyone misunderstood (The version supported by C++ and Java has little to do with what OOP originally meant, as in SmallTalk), and then pretended was the holy grail. There are aspects to it that are useful, certainly, but it is often not the best approach for designing an application.
Rather, express the overall logic by other means, for example generic programming, and when you need a class to encapsulate some simple concept, by all means design it according to OOP principles.
OOP is just a tool among many. The goal is not to write OOP code, but to write good code. Sometimes, the way to do this is by using OOP principles, but often, you can get better code using generic programmming principles, or functional programming.
It is a very project dependent decision. My general feel of OOP is that its useful for organizing large projects that involve multiple components. One area I find that OOP is especially pointless is school assignments. Excepting those specifically designed to teach OOP concepts, or large software design concepts, many of my assignments, specifically those in more algorithmy type classes are best suited to non-OOP design.
So specifically, smaller projects, that are not likely to grow large, and projects that center around a single algorithm seem to be non-OOP candidates in my books. Also, if you can write the specification as a linear set of steps, e.g., with no interactive GUI or state to maintain, this would also be an opportunity.
Of course, if you're required to use an OOP design, or an OOP toolkit, or if you have well defined 'objects' in you're spec, or if you need the features of polymorphism, etc. etc. etc...there are plenty of reasons to use it, the above seem to be indicators of when it would be simple not to.
Just my $0.02.
Having an Ada background, I develop in C in terms of packages containing data and their associated functions. This gives a code very modular with pieces of code that can be taken apart and reused on other projects. I don't feel the need to use OOP.
When I develop in Objective-C, objects are the natural container for data and code. I still develop with more or less the package concept in mind with some new cool features.
I'm used to be an OOP fanboy... Then realized using functions, generics and callbacks can often make a more elegant and change-friendly solution in C++ than classes and virtual functions.
Other big names realized it too: http://harmful.cat-v.org/software/OO_programming/
IMHO, I have a feeling that the OOP concept is not really suits the needs of the Big Data, as OOP assume all the stuff to be kept in memory (concept of Objects and member variables). This always result in memory demanding and heavy applications when OOP is used for example for big images processing. Instead, the simplicity of C maybe used with intensive parallel I/O making apps more efficient and easy to implement. It is the year 2019 I am writing this message...Everything may change in a year! :)
In my mind it comes down to what kind of model suits the problem at hand. It seems to me that OOP is best suited to coding GUI programs, in that the data and functionality for a graphical object is easily bundled together. Other problems- (such as a webserver, as an example off the top of my head), might be more easily modeled with a data centric approach, where there's no strong advantage to having a method and its data near each-other.
tl;dr depends on the problem.
I'd say the greatest benefit of C++ OOP is inheritance and polymorphism (Virtual function etc...) .
This allows for code reuse and extendibility
C++, use OOP - - - C, no, with certain exceptions
In C++ you should use OOP. It's a nice abstraction and it's the tool you are given. You either use it or leave it in the box where it can't help. You don't use the power saw for everything but I would read the manual and have it ready for the right job.
In C, it's a more difficult call. While you can certainly write arbitrarily object-oriented code in C, it's enough of a pain that you immediately find yourself fighting the language in order to use it. You may be more productive dropping the doesn't-fit-so-well design pattern and programming as C was intended to be used.
Furthermore, every time you make an array of function pointers or something in an OOP-in-C design pattern, you sever almost completely all visible links in the inheritance chain, making the code hard to maintain. In real OOP languages, there is an obvious chain of derived classes, often analyzed and documented for you. (mmm, javadoc.) Not so in OOP-in-C, and the tools available won't be able to see it.
So, I would argue in general against OOP in C. For a really complex program, you may well need the abstraction, and then you will have to do it despite needing to fight the language in the process and despite making the program quite hard to follow by anyone other than the original author.
But if you knew the program was going to become that complicated, you shouldn't have written it in C in the first place...
In C, there are some times when I 'emulate' the object oriented approach, by defining some sort of constructor with granular control over things like callbacks, when running several instances of it.
For instance, lets say I have some spiffy event handler library and I know that down the road I'm going to need many allocated copies:
So I would have (in C)
MyEvent *ev1 = new_eventhandler();
set_event_callback_func(ev1, callback_one);
ev1->setfd(fd1);
MyEvent *ev2 = new_eventhandler();
set_event_callback_func(ev2, callback_two);
ev2->setfd(fd2);
destroy_eventhandler(ev1);
destroy_eventhandler(ev2);
Obviously, I would later do something useful with that like handle received events in the two respective callback functions. I'm not going to really elaborate on the method of typing function pointers and structures to hold them, nor what would go on in the 'constructor' because its pretty obvious.
I think, this approach works for more advanced interfaces where its desirable to allow the user to define their own callbacks (and change them on the fly), or when working on complex non-blocking I/O services.
Otherwise, I much prefer a more procedural / functional approach.
Probably an unpopular idea but I think you should stick with non-OOP unless it adds something useful. In most practical problems OOP is useful but if I'm just playing with an idea I start writing non-object code and put functions and data into classes if it becomes useful.
Of course I still use other objects in my code (std::vector et al) and I use namespaces to help organise my functions but why put code into objects until it is useful? Equally don't shy away from free functions in an OO solution.
The question is tricky because OOP encompasses several concepts: object encapsulation, polymorphism, inheritance, etc. It's easy to take those ideas too far. Here's a concrete example:
When C++ first caught on, zillions of string classes sprung into being. Everything you could possibly imagine doing to a string (upcasing, downcasing, trimming, tokenizing, parsing, etc.) was a member function of some string class.
Notice, though, that std::strings from the STL don't have all these methods. STL is object-oriented--the state and implementation details of a string object are well encapsulated, only a small, orthogonal interface is exposed to the world. All the crazy manipulations that people used to include as member functions are now delegated to non-member functions.
This is powerful, because these functions can now work on any string class that exposes the same interface. If you use STL strings for most things and a specialty version tuned to your program's idiosyncracies, you don't have to duplicate member functions. You just have to implement the basic string interface and then you can re-use all those crazy manipulations.
Some people call this hybrid approach generic programming. It's still object-oriented programming, but it moves away from the "everything is a member-function" mentality that a lot of people associate with OOP.

Converting C source to C++

How would you go about converting a reasonably large (>300K), fairly mature C codebase to C++?
The kind of C I have in mind is split into files roughly corresponding to modules (i.e. less granular than a typical OO class-based decomposition), using internal linkage in lieu private functions and data, and external linkage for public functions and data. Global variables are used extensively for communication between the modules. There is a very extensive integration test suite available, but no unit (i.e. module) level tests.
I have in mind a general strategy:
Compile everything in C++'s C subset and get that working.
Convert modules into huge classes, so that all the cross-references are scoped by a class name, but leaving all functions and data as static members, and get that working.
Convert huge classes into instances with appropriate constructors and initialized cross-references; replace static member accesses with indirect accesses as appropriate; and get that working.
Now, approach the project as an ill-factored OO application, and write unit tests where dependencies are tractable, and decompose into separate classes where they are not; the goal here would be to move from one working program to another at each transformation.
Obviously, this would be quite a bit of work. Are there any case studies / war stories out there on this kind of translation? Alternative strategies? Other useful advice?
Note 1: the program is a compiler, and probably millions of other programs rely on its behaviour not changing, so wholesale rewriting is pretty much not an option.
Note 2: the source is nearly 20 years old, and has perhaps 30% code churn (lines modified + added / previous total lines) per year. It is heavily maintained and extended, in other words. Thus, one of the goals would be to increase mantainability.
[For the sake of the question, assume that translation into C++ is mandatory, and that leaving it in C is not an option. The point of adding this condition is to weed out the "leave it in C" answers.]
Having just started on pretty much the same thing a few months ago (on a ten-year-old commercial project, originally written with the "C++ is nothing but C with smart structs" philosophy), I would suggest using the same strategy you'd use to eat an elephant: take it one bite at a time. :-)
As much as possible, split it up into stages that can be done with minimal effects on other parts. Building a facade system, as Federico Ramponi suggested, is a good start -- once everything has a C++ facade and is communicating through it, you can change the internals of the modules with fair certainty that they can't affect anything outside them.
We already had a partial C++ interface system in place (due to previous smaller refactoring efforts), so this approach wasn't difficult in our case. Once we had everything communicating as C++ objects (which took a few weeks, working on a completely separate source-code branch and integrating all changes to the main branch as they were approved), it was very seldom that we couldn't compile a totally working version before we left for the day.
The change-over isn't complete yet -- we've paused twice for interim releases (we aim for a point-release every few weeks), but it's well on the way, and no customer has complained about any problems. Our QA people have only found one problem that I recall, too. :-)
What about:
Compiling everything in C++'s C subset and get that working, and
Implementing a set of facades leaving the C code unaltered?
Why is "translation into C++ mandatory"? You can wrap the C code without the pain of converting it into huge classes and so on.
Your application has lots of folks working on it, and a need to not-be-broken.
If you are serious about large scale conversion to an OO style, what
you need is massive transformation tools to automate the work.
The basic idea is to designate groups of data as classes, and then
get the tool to refactor the code to move that data into classes,
move functions on just that data into those classes,
and revise all accesses to that data to calls on the classes.
You can do an automated preanalysis to form statistic clusters to get some ideas,
but you'll still need an applicaiton aware engineer to decide what
data elements should be grouped.
A tool that is capable of doing this task is our DMS Software Reengineering
Toolkit.
DMS has strong C parsers for reading your code, captures the C code
as compiler abstract syntax trees, (and unlike a conventional compiler)
can compute flow analyses across your entire 300K SLOC.
DMS has a C++ front end that can be used as the "back" end;
one writes transformations that map C syntax to C++ syntax.
A major C++ reengineering task on a large avionics system gives
some idea of what using DMS for this kind of activity is like.
See technical papers at
www.semdesigns.com/Products/DMS/DMSToolkit.html,
specifically
Re-engineering C++ Component Models Via Automatic Program Transformation
This process is not for the faint of heart. But than anybody
that would consider manual refactoring of a large application
is already not afraid of hard work.
Yes, I'm associated with the company, being its chief architect.
I would write C++ classes over the C interface. Not touching the C code will decrease the chance of messing up and quicken the process significantly.
Once you have your C++ interface up; then it is a trivial task of copy+pasting the code into your classes. As you mentioned - during this step it is vital to do unit testing.
GCC is currently in midtransition to C++ from C. They started by moving everything into the common subset of C and C++, obviously. As they did so, they added warnings to GCC for everything they found, found under -Wc++-compat. That should get you on the first part of your journey.
For the latter parts, once you actually have everything compiling with a C++ compiler, I would focus on replacing things that have idiomatic C++ counterparts. For example, if you're using lists, maps, sets, bitvectors, hashtables, etc, which are defined using C macros, you will likely gain a lot by moving these to C++. Likewise with OO, you'll likely find benefits where you are already using a C OO idiom (like struct inheritence), and where C++ will afford greater clarity and better type checking on your code.
Your list looks okay except I would suggest reviewing the test suite first and trying to get that as tight as possible before doing any coding.
Let's throw another stupid idea:
Compile everything in C++'s C subset and get that working.
Start with a module, convert it in a huge class, then in an instance, and build a C interface (identical to the one you started from) out of that instance. Let the remaining C code work with that C interface.
Refactor as needed, growing the OO subsystem out of C code one module at a time, and drop parts of the C interface when they become useless.
Probably two things to consider besides how you want to start are on what you want to focus, and where you want to stop.
You state that there is a large code churn, this may be a key to focus your efforts. I suggest you pick the parts of your code where a lot of maintenance is needed, the mature/stable parts are apparently working well enough, so it is better to leave them as they are, except probably for some window dressing with facades etc.
Where you want to stop depends on what the reason is for wanting to convert to C++. This can hardly be a goal in itself. If it is due to some 3rd party dependency, focus your efforts on the interface to that component.
The software I work on is a huge, old code base which has been 'converted' from C to C++ years ago now. I think it was because the GUI was converted to Qt. Even now it still mostly looks like a C program with classes. Breaking the dependencies caused by public data members, and refactoring the huge classes with procedural monster methods into smaller methods and classes never has really taken off, I think for the following reasons:
There is no need to change code that is working and that does not need to be enhanced. Doing so introduces new bugs without adding functionality, and end users don't appreciate that;
It is very, very hard to do refactor reliably. Many pieces of code are so large and also so vital that people hardly dare touching it. We have a fairly extensive suite of functional tests, but sufficient code coverage information is hard to get. As a result, it is difficult to establish whether there are already sufficient tests in place to detect problems during refactoring;
The ROI is difficult to establish. The end user will not benefit from refactoring, so it must be in reduced maintenance cost, which will increase initially because by refactoring you introduce new bugs in mature, i.e. fairly bug-free code. And the refactoring itself will be costly as well ...
NB. I suppose you know the "Working effectively with Legacy code" book?
You mention that your tool is a compiler, and that: "Actually, pattern matching, not just type matching, in the multiple dispatch would be even better".
You might want to take a look at maketea. It provides pattern matching for ASTs, as well as the AST definition from an abstract grammar, and visitors, tranformers, etc.
If you have a small or academic project (say, less than 10,000 lines), a rewrite is probably your best option. You can factor it however you want, and it won't take too much time.
If you have a real-world application, I'd suggest getting it to compile as C++ (which usually means primarily fixing up function prototypes and the like), then work on refactoring and OO wrapping. Of course, I don't subscribe to the philosophy that code needs to be OO structured in order to be acceptable C++ code. I'd do a piece-by-piece conversion, rewriting and refactoring as you need to (for functionality or for incorporating unit testing).
Here's what I would do:
Since the code is 20 years old, scrap down the parser/syntax analyzer and replace it with one of the newer lex/yacc/bison(or anything similar) etc based C++ code, much more maintainable and easier to understand. Faster to develop too if you have a BNF handy.
Once this is retrofitted to the old code, start wrapping modules into classes. Replace global/shared variables with interfaces.
Now what you have will be a compiler in C++ (not quite though).
Draw a class diagram of all the classes in your system, and see how they are communicating.
Draw another one using the same classes and see how they ought to communicate.
Refactor the code to transform the first diagram to the second. (this might be messy and tricky)
Remember to use C++ code for all new code added.
If you have some time left, try replacing data structures one by one to use the more standardized STL or Boost.