Need refactoring ideas for Arrow Anti-Pattern

Need refactoring ideas for Arrow Anti-Pattern - if-statement

I have inherited a monster.
It is masquerading as a .NET 1.1 application processes text files that conform to Healthcare Claim Payment (ANSI 835) standards, but it's a monster. The information being processed relates to healthcare claims, EOBs, and reimbursements. These files consist of records that have an identifier in the first few positions and data fields formatted according to the specs for that type of record. Some record ids are Control Segment ids, which delimit groups of records relating to a particular type of transaction.
To process a file, my little monster reads the first record, determines the kind of transaction that is about to take place, then begins to process other records based on what kind of transaction it is currently processing. To do this, it uses a nested if. Since there are a number of record types, there are a number decisions that need to be made. Each decision involves some processing and 2-3 other decisions that need to be made based on previous decisions. That means the nested if has a lot of nests. That's where my problem lies.
This one nested if is 715 lines long. Yes, that's right. Seven-Hundred-And-Fif-Teen Lines. I'm no code analysis expert, so I downloaded a couple of freeware analysis tools and came up with a McCabe Cyclomatic Complexity rating of 49. They tell me that's a pretty high number. High as in pollen count in the Atlanta area where 100 is the standard for high and the news says "Today's pollen count is 1,523". This is one of the finest examples of the Arrow Anti-Pattern I have ever been priveleged to see. At its highest, the indentation goes 15 tabs deep.
My question is, what methods would you suggest to refactor or restructure such a thing?
I have spent some time searching for ideas, but nothing has given me a good foothold. For example, substituting a guard condition for a level is one method. I have only one of those. One nest down, fourteen to go.
Perhaps there is a design pattern that could be helpful. Would Chain of Command be a way to approach this? Keep in mind that it must stay in .NET 1.1.
Thanks for any and all ideas.

I just had some legacy code at work this week that was similar (although not as dire) as what you are describing.
There is no one thing that will get you out of this. The state machine might be the final form your code takes, but thats not going to help you get there, nor should you decide on such a solution before untangling the mess you already have.
First step I would take is to write a test for the existing code. This test isn't to show that the code is correct but to make sure you have not broken something when you start refactoring. Get a big wad of data to process, feed it to the monster, and get the output. That's your litmus test. if you can do this with a code coverage tool you will see what you test does not cover. If you can, construct some artificial records that will also exercise this code, and repeat. Once you feel you have done what you can with this task, the output data becomes your expected result for your test.
Refactoring should not change the behavior of the code. Remember that. This is why you have known input and known output data sets to validate you are not going to break things. This is your safety net.
Now Refactor!
A couple things I did that i found useful:
Invert if statements
A huge problem I had was just reading the code when I couldn't find the corresponding else statement, I noticed that a lot of the blocks looked like this
if (someCondition)
{
100+ lines of code
{
...
}
}
else
{
simple statement here
}
By inverting the if I could see the simple case and then move onto the more complex block knowing what the other one already did. not a huge change, but helped me in understanding.
Extract Method
I used this a lot.Take some complex multi line block, grok it and shove it aside in it's own method. this allowed me to more easily see where there was code duplication.
Now, hopefully, you haven't broken your code (test still passes right?), and you have more readable and better understood procedural code. Look it's already improved! But that test you wrote earlier isn't really good enough... it only tells you that you a duplicating the functionality (bugs and all) of the original code, and thats only the line you had coverage on as I'm sure you would find blocks of code that you can't figure out how to hit or just cannot ever hit (I've seen both in my work).
Now the big changes where all the big name patterns come into play is when you start looking at how you can refactor this in a proper OO fashion. There is more than one way to skin this cat, and it will involve multiple patterns. Not knowing details about the format of these files you're parsing I can only toss around some helpful suggestions that may or may not be the best solutions.
Refactoring to Patterns is a great book to assist in explainging patterns that are helpful in these situations.
You're trying to eat an elephant, and there's no other way to do it but one bite at a time. Good luck.

A state machine seems like the logical place to start, and using WF if you can swing it (sounds like you can't).
You can still implement one without WF, you just have to do it yourself. However, thinking of it like a state machine from the start will probably give you a better implementation then creating a procedural monster that checks internal state on every action.
Diagram out your states, what causes a transition. The actual code to process a record should be factored out, and called when the state executes (if that particular state requires it).
So State1's execute calls your "read a record", then based on that record transitions to another state.
The next state may read multiple records and call record processing instructions, then transition back to State1.

One thing I do in these cases is to use the 'Composed Method' pattern. See Jeremy Miller's Blog Post on this subject. The basic idea is to use the refactoring tools in your IDE to extract small meaningful methods. Once you've done that, you may be able to further refactor and extract meaningful classes.

I would start with uninhibited use of Extract Method. If you don't have it in your current Visual Studio IDE, you can either get a 3rd-party addin, or load your project in a newer VS. (It'll try to upgrade your project, but you will carefully ignore those changes instead of checking them in.)
You said that you have code indented 15 levels. Start about 1/2-way out, and Extract Method. If you can come up with a good name, use it, but if you can't, extract anyway. Split in half again. You're not going for the ideal structure here; you're trying to break the code in to pieces that will fit in your brain. My brain is not very big, so I'd keep breaking & breaking until it doesn't hurt any more.
As you go, look for any new long methods that seem to be different than the rest; make these in to new classes. Just use a simple class that has only one method for now. Heck, making the method static is fine. Not because you think they're good classes, but because you are so desperate for some organization.
Check in often as you go, so you can checkpoint your work, understand the history later, be ready to do some "real work" without needing to merge, and save your teammates the hassle of hard merging.
Eventually you'll need to go back and make sure the method names are good, that the set of methods you've created make sense, clean up the new classes, etc.
If you have a highly reliable Extract Method tool, you can get away without good automated tests. (I'd trust VS in this, for example.) Otherwise, make sure you're not breaking things, or you'll end up worse than you started: with a program that doesn't work at all.
A pairing partner would be helpful here.

Judging by the description, a state machine might be the best way to deal with it. Have an enum variable to store the current state, and implement the processing as a loop over the records, with a switch or if statements to select the action to take based on the current state and the input data. You can also easily dispatch the work to separate functions based on the state using function pointers, too, if it's getting too bulky.

There was a pretty good blog post about it at Coding Horror. I've only come across this anti-pattern once, and I pretty much just followed his steps.

Sometimes I combine the state pattern with a stack.
It works well for hierarchical structures; a parent element knows what state to push onto the stack to handle a child element, but a child doesn't have to know anything about its parent. In other words, the child doesn't know what the next state is, it simply signals that it is "complete" and gets popped off the stack. This helps to decouple the states from each other by keeping dependencies uni-directional.
It works great for processing XML with a SAX parser (the content handler just pushes and pops states to change its behavior as elements are entered and exited). EDI should lend itself to this approach too.

Related

Removing dependencies from statechart framework

I've got lots of problems with project i am currently working on. The project is more than 10 years old and it was based on one of those commercial C++ frameworks which were very populary in the 90's. The problem is with statecharts. The framework provides quite common implementation of state pattern. Each state is a separate class, with action on entry, action in state etc. There is a switch which sets current state according to received events.
Devil is hidden in details. That project is enormous. It's something about 2000 KLOC. There is definitely too much statecharts (i've seen "for" loops implemented using statecharts). What's more ... framework allows to embed statechart in another statechart so there are many statecherts with seven or even more levels of nesting. Because statecharts run in different threads, and it's possible to send events between statecharts we have lots of synchronization problems (and big mess in interfaces).
I must admit that scale of this problem is overwhelming and I don't know how to touch it. My first idea was to remove as much code as I can from statecharts and put it into separate classes. Then delegate these classes from statechart to do a job. But in result we will have many separate functions, which logically don't have any specific functionality and any change in statechart architecture will need also a change of that classes and functions.
So I asking for help:
Do you know any books/articles/magic artefacts which can help me to fix this ? I would like to at least separate as much code as I can from statechart without introducing any hidden dependencies and keep separated code maintainable, testable and reusable.
If you have any suggestion how to handle this, please let me know.

The statechart pattern is intended to be used specifically to remove switch statements, so this sounds like a horrid abuse. Additionally, states should only change on asynchronous events. If you are processing an event and you change through multiple states (or for loop, etc.), then this is also a horrid abuse of the pattern.
I would start from these two points, as they will solve much of your concurrency issues just fixing them up. What you need to determine is:
What are your external, asynchronous events to the system? These are the only things that should be determining state transitions, not things that happen during event processing. An event may cause 0 or 1 state transitions. Once you have a list of these state transitions, you can reconstruct the actual states of your system. If you are aware of UML State diagrams, this would be a perfect time to sketch one up in a charting program, not just for yourself (though it will help you immensely), but also for everyone in the future that has to return to the project. As you have learned, this happens.
Now that you know what are really states, list what are states in the code that shouldn't be. This usually indicates that something can be "functionally decomposed". Instead of a state object for each of these, likely all that is needed is a separate function. This will cut down on a lot of the overhead of state objects and should clean up the code immensely.
Now it's time to tackle those horrendous switch statements you mentioned. If they were truly based on state, you shouldn't need one at all. Instead, you should be able to call the state machine directly.
Something like:
myStateMachine->myEvent();
and it should work without any switch. But notice, this may be the case even for some of those objects that don't work across asynchronous events. This is also an indication of where you may just use inheritance to get the same effect. If you have:
switch (someTypeIdentifier)
{
case type1:
doSomething();
break;
case type2:
doSomethingElse();
break;
}
usually the correct OOP method to do is to create two actual types Type1, Type2, both derived from an abstract base TypeBase, with a virtual method doSomething() that does what you need. The reason this is useful is because it means you can "close" the handling (in the meaning of the Open/Closed Principle), and still extend the functionality by adding new derived types as needed (leaving it open to extension). This saves bugs like crazy because it gets developers hands out of those switch statements, which can get quite ugly and convoluted, instead encapsulating each separate behavior in separate classes.
4 - Now look to fix up your thread issues. Identify all objects used from multiple threads. Make a list. Now, how are these used? Are some of them always used together? Start making groups. The goal here is to find the level of encapsulation that best works for these objects, separate the objects into individual classes that control their own synchronisation, figure out the atomic level of actual "transactions" for the objects, and make methods of the classes that expose those meaningful transactions, wrapped behind the scenes with the appropriate mutexes, condition variables, etc.
You might be saying "that sounds like a lot of work! Why do all that instead of just writing it all over myself?" Good question! :) The reason is actually straightforward: if you are going to do it all by yourself, those are the steps you should be doing anyway. You should be identifying your states, your dynamic polymorphism, and getting a handle on the multithreaded transactions. But, if you start with the existing code, you also have all of those unspoken business rules that were never documented and may cause all sorts of unexpected bugs down the line. You don't have to bring everything over - if you suspect it's a bug, discuss the logic with the people who have worked with the system in the past (if available), QA, or whoever might identify bugs, and see if it really should be carried over. But you need to actually evaluate what the bugs are either way, or you may not code something that actually needed coding.
In the end, this is a manual process that is a part of software engineering. There are CASE tools that can help draw up the state diagrams and even publish them to code, there are refactoring tools, like those found in many IDEs, that can help move code between functions and classes, and similar tools which can help identify threading needs. However, those things shouldn't be picked up for a single project. They need to be learned throughout your career, picking them up and learning them more deeply over years of work, as they are a part of being a software engineer. They don't do it for you. You still need to know the whys and hows, and they just help get it done more efficiently.

Statecharts (including nested Statecharts) are a powerful way to specify, understand and even simulate/validate complex control flow. But to gain the benefit, you need the statechart model in a suitable tool (I used Statemate way back in the day, not sure if it's still available), plus a reliable mapping from the chart to the code (Statemate used to generate the code) - then you can forget about the state management code (mostly)! In your situation, if you don't have the model, I would try to reverse one from the code - as Ira says, chances are high that the original developers had a model in some form, and you may find the code making a lot of sense as the model emerges. If this works out, you will have a really good spec/model of the code which should make future code edits much easier (even if you don't want to go to automatic code generation, and maintain the code/model mapping manually (but you'll need to be meticulous!!))

Sounds to me like your best bet is (gulp!) likely to start from scratch if it's as horrifically broken as you make out. Is there any documentation? Could you begin to build some saner software based on the docs?
If a complete re-write isn't an option (and they never are in my experience) I'd try some of the following:
If you don't already have it, draw an architectural picture of the whole system. Sketch out how all the bits are supposed to work together and that will help you break the system down into potentially manageable / testable parts.
Do you have any kind of requirements or testing plan in place? If not, can you write one and start to put unit tests in place for the various chunks of code / functionality which exist already? If you can do that, you can start to refactor things without breaking as much of whatever does currently work.
Once you've broken things down a bit, start building your unit tests into integration tests which pull together more of the functionality.
I've not read them myself, but I've heard good things about these books which may have some advice you can use:
Refactoring: Improving the Design of Existing Code (Object Technology Series).
Working Effectively with Legacy Code (Robert C. Martin)
Good luck! :-)

What to do about a 11000 lines C++ source file?

So we have this huge (is 11000 lines huge?) mainmodule.cpp source file in our project and every time I have to touch it I cringe.
As this file is so central and large, it keeps accumulating more and more code and I can't think of a good way to make it actually start to shrink.
The file is used and actively changed in several (> 10) maintenance versions of our product and so it is really hard to refactor it. If I were to "simply" split it up, say for a start, into 3 files, then merging back changes from maintenance versions will become a nightmare. And also if you split up a file with such a long and rich history, tracking and checking old changes in the SCC history suddenly becomes a lot harder.
The file basically contains the "main class" (main internal work dispatching and coordination) of our program, so every time a feature is added, it also affects this file and every time it grows. :-(
What would you do in this situation? Any ideas on how to move new features to a separate source file without messing up the SCC workflow?
(Note on the tools: We use C++ with Visual Studio; We use AccuRev as SCC but I think the type of SCC doesn't really matter here; We use Araxis Merge to do actual comparison and merging of files)

Merging will not be such a big nightmare as it will be when you'll get 30000 LOC file in the future. So:
Stop adding more code to that file.
Split it.
If you can't just stop coding during refactoring process, you could leave this big file as is for a while at least without adding more code to it: since it contains one "main class" you could inherit from it and keep inherited class(es) with overloaded functions in several new small and well designed files.

Find some code in the file which is relatively stable (not changing fast, and doesn't vary much between branches) and could stand as an independent unit. Move this into its own file, and for that matter into its own class, in all branches. Because it's stable, this won't cause (many) "awkward" merges that have to be applied to a different file from the one they were originally made on, when you merge the change from one branch to another. Repeat.
Find some code in the file which basically only applies to a small number of branches, and could stand alone. Doesn't matter whether it's changing fast or not, because of the small number of branches. Move this into its own classes and files. Repeat.
So, we've got rid of the code that's the same everywhere, and the code that's specific to certain branches.
This leaves you with a nucleus of badly-managed code - it's needed everywhere, but it's different in every branch (and/or it changes constantly so that some branches are running behind others), and yet it's in a single file that you're unsuccessfully trying to merge between branches. Stop doing that. Branch the file permanently, perhaps by renaming it in each branch. It's not "main" any more, it's "main for configuration X". OK, so you lose the ability to apply the same change to multiple branches by merging, but this is in any case the core of code where merging doesn't work very well. If you're having to manually manage the merges anyway to deal with conflicts, then it's no loss to manually apply them independently on each branch.
I think you're wrong to say that the kind of SCC doesn't matter, because for example git's merging abilities are probably better than the merge tool you're using. So the core problem, "merging is difficult" occurs at different times for different SCCs. However, you're unlikely to be able to change SCCs, so the issue is probably irrelevant.

It sounds to me like you're facing a number of code smells here. First of all the main class appears to violate the open/closed principle. It also sounds like it is handling too many responsibilities. Due to this I would assume the code to be more brittle than it needs to be.
While I can understand your concerns regarding traceability following a refactoring, I would expect that this class is rather hard to maintain and enhance and that any changes you do make are likely to cause side effects. I would assume that the cost of these outweighs the cost of refactoring the class.
In any case, since the code smells will only get worse with time, at least at some point the cost of these will outweigh the cost of refactoring. From your description I would assume that you're past the tipping point.
Refactoring this should be done in small steps. If possible add automated tests to verify current behavior before refactoring anything. Then pick out small areas of isolated functionality and extract these as types in order to delegate the responsibility.
In any case, it sounds like a major project, so good luck :)

The only solution I have ever imagined to such problems follows. The actual gain by the described method is progressiveness of the evolutions. No revolutions here, otherwise you'll be in trouble very fast.
Insert a new cpp class above the original main class. For now, it would basically redirect all calls to the current main class, but aim at making the API of this new class as clear and succinct as possible.
Once this has been done, you get the possibility to add new functionalities in new classes.
As for existing functionalities, you have to progressively move them in new classes as they become stable enough. You will lose SCC help for this piece of code, but there is not much that can be done about that. Just pick the right timing.
I know this is not perfect, though I hope it can help, and the process must be adapted to your needs!
Additional information
Note that Git is an SCC that can follow pieces of code from one file to another. I have heard good things about it, so it could help while you are progressively moving your work.
Git is constructed around the notion of blobs which, if I understand correctly, represent pieces of code files. Move these pieces around in different files and Git will find them, even if you modify them. Apart from the video from Linus Torvalds mentioned in comments below, I have not been able to find something clear about this.

Confucius say: "first step to getting out of hole is to stop digging hole."

Let me guess: Ten clients with divergent feature sets and a sales manager that promotes "customization"? I've worked on products like that before. We had essentially the same problem.
You recognize that having an enormous file is trouble, but even more trouble is ten versions that you have to keep "current". That's multiple maintenance. SCC can make that easier, but it can't make it right.
Before you try to break the file into parts, you need to bring the ten branches back in sync with each other so that you can see and shape all the code at once. You can do this one branch at a time, testing both branches against the same main code file. To enforce the custom behavior, you can use #ifdef and friends, but it's better as much as possible to use ordinary if/else against defined constants. This way, your compiler will verify all types and most probably eliminate "dead" object code anyway. (You may want to turn off the warning about dead code, though.)
Once there's only one version of that file shared implicitly by all branches, then it's rather easier to begin traditional refactoring methods.
The #ifdefs are primarily better for sections where the affected code only makes sense in the context of other per-branch customizations. One may argue that these also present an opportunity for the same branch-merging scheme, but don't go hog-wild. One colossal project at a time, please.
In the short run, the file will appear to grow. This is OK. What you're doing is bringing things together that need to be together. Afterwards, you'll begin to see areas that are clearly the same regardless of version; these can be left alone or refactored at will. Other areas will clearly differ depending on the version. You have a number of options in this case. One method is to delegate the differences to per-version strategy objects. Another is to derive client versions from a common abstract class. But none of these transformations are possible as long as you have ten "tips" of development in different branches.

I don't know if this solves your problem, but what I guess you want to do is migrate the content of the file to smaller files independent of each other (summed up).
What I also get is that you have about 10 different versions of the software floating around and you need to support them all without messing things up.
First of all there is just no way that this is easy and will solve itself in a few minutes of brainstorming. The functions linked in your file are all vital to your application, and simply cutting them of and migrating them to other files won't save your problem.
I think you only have these options:
Don't migrate and stay with what you have. Possibly quit your job and start working on serious software with good design in addition. Extreme programming is not always the best solution if you are working on a long time project with enough funds to survive a crash or two.
Work out a layout of how you would love your file to look once it's split up. Create the necessary files and integrate them in your application. Rename the functions or overload them to take an additional parameter (maybe just a simple boolean?).
Once you have to work on your code, migrate the functions you need to work on to the new file and map the function calls of the old functions to the new functions.
You should still have your main-file this way, and still be able to see the changes that were made to it, once it comes to a specific function you know exactly when it was outsourced and so on.
Try to convince your co-workers with some good cake that workflow is overrated and that you need to rewrite some parts of the application in order to do serious business.

Exactly this problem is handled in one of the chapters of the book "Working Effectively with Legacy Code" (http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052).

I think you would be best off creating a set of command classes that map to the API points of the mainmodule.cpp.
Once they are in place, you will need to refactor the existing code base to access these API points via the command classes, once that's done, you are free to refactor each command's implementation into a new class structure.
Of course, with a single class of 11 KLOC the code in there is probably highly coupled and brittle, but creating individual command classes will help much more than any other proxy/facade strategy.
I don't envy the task, but as time goes on this problem will only get worse if it's not tackled.
Update
I'd suggest that the Command pattern is preferable to a Facade.
Maintaining/organizing a lot of different Command classes over a (relatively) monolithic Facade is preferable. Mapping a single Facade onto a 11 KLOC file will probably need to be broken up into a few different groups itself.
Why bother trying to figure out these facade groups? With the Command pattern you will be able to group and organise these small classes organically, so you have a lot more flexibility.
Of course, both options are better than the single 11 KLOC and growing, file.

One important advice: Do not mix refactoring and bugfixes. What you want is a Version of your program that is identical to the previous version, except that the source code is differently.
One way could be to start splitting up the least big function/part into it's own file and then either include with a header (thus turning main.cpp into a list of #includes, which sounds a code smell in itself *I'm not a C++ Guru though), but at least it's now split into files).
You could then try to switch all maintenance releases over to the "new" main.cpp or whatever your structure is. Again: No other changes or Bugfixes because tracking those is confusing as hell.
Another thing: As much as you may desire making one big pass at refactoring the whole thing in one go, you might bite off more than you can chew. Maybe just pick one or two "parts", get them into all the releases, then add some more value for your customer (after all, Refactoring does not add direct value so it is a cost that has to be justified) and then pick another one or two parts.
Obviously that requires some discipline in the team to actually use the split files and not just add new stuff to the main.cpp all the time, but again, trying to do one massive refactor may not be the best course of action.

Rofl, this reminds me of my old job. It seems that, before I joined, everything was inside one huge file (also C++). Then they've split it up (at completely random points using includes) into about three (still huge files). The quality of this software was, as you might expect, horrible. The project totaled at about 40k LOC. (containing almost no comments but LOTS of duplicate code)
In the end I did a complete rewrite of the project. I started by redoing the worst part of the project from scratch. Of course I had in mind a possible (small) interface between this new part and the rest. Then I did insert this part into the old project. I didn't refactor the old code to create the interface necessary, but just replaced it. Then I took made small steps from there, rewriting the old code.
I have to say that this took about half a year and there was no development of the old code base beside bugfixes during that time.
edit:
The size stayed at about 40k LOC but the new application contained many more features and presumably less bugs in its initial version than the 8 year old software. One reason of the rewrite was also that we needed the new features and introducing them inside the old code was nearly impossible.
The software was for an embedded system, a label printer.
Another point that I should add is that in theory the project was C++. But it wasn't OO at all, it could have been C. The new version was object oriented.

OK so for the most part rewriting API of production code is a bad idea as a start. Two things need to happen.
One, you need to actually have your team decide to do a code freeze on current production version of this file.
Two, you need to take this production version and create a branch that manages the builds using preprocessing directives to split up the big file. Splitting the compilation using JUST preprocessor directives (#ifdefs, #includes, #endifs) is easier than recoding the API. It's definitely easier for your SLAs and ongoing support.
Here you could simply cut out functions that relate to a particular subsystem within the class and put them in a file say mainloop_foostuff.cpp and include it in mainloop.cpp at the right location.
OR
A more time consuming but robust way would be to devise an internal dependencies structure with double-indirection in how things get included. This will allow you to split things up and still take care of co-dependencies. Note that this approach requires positional coding and therefore should be coupled with appropriate comments.
This approach would include components that get used based on which variant you are compiling.
The basic structure is that your mainclass.cpp will include a new file called MainClassComponents.cpp after a block of statements like the following:
#if VARIANT == 1
# define Uses_Component_1
# define Uses_Component_2
#elif VARIANT == 2
# define Uses_Component_1
# define Uses_Component_3
# define Uses_Component_6
...
#endif
#include "MainClassComponents.cpp"
The primary structure of the MainClassComponents.cpp file would be there to work out dependencies within the sub components like this:
#ifndef _MainClassComponents_cpp
#define _MainClassComponents_cpp
/* dependencies declarations */
#if defined(Activate_Component_1)
#define _REQUIRES_COMPONENT_1
#define _REQUIRES_COMPONENT_3 /* you also need component 3 for component 1 */
#endif
#if defined(Activate_Component_2)
#define _REQUIRES_COMPONENT_2
#define _REQUIRES_COMPONENT_15 /* you also need component 15 for this component */
#endif
/* later on in the header */
#ifdef _REQUIRES_COMPONENT_1
#include "component_1.cpp"
#endif
#ifdef _REQUIRES_COMPONENT_2
#include "component_2.cpp"
#endif
#ifdef _REQUIRES_COMPONENT_3
#include "component_3.cpp"
#endif
#endif /* _MainClassComponents_h */
And now for each component you create a component_xx.cpp file.
Of course i am using numbers but you should use something more logical based on your code.
Using preprocessor allows you to split things up without having to worry about API changes which is a nightmare in production.
Once you have production settled you can then actually work on redesign.

Well I understand your pain :) I've been in a few such projects as well and it's not pretty. There is no easy answer for this.
One approach that may work for you is to start adding safe guards in all functions, that is, checking arguments, pre/post-conditions in methods, then eventually adding unit tests all in order to capture the current functionality of the sources. Once you have this you are better equipped to re-factor the code because you will have asserts and errors popping up alerting you if you have forgotten something.
Sometimes though there are times when refactoring just may bring more pain than benefit. Then it may be better to just leave the original project and in a pseudo maintenance state and start from scratch and then incrementally adding the functionality from the beast.

You should not be concerned with reducing the file-size, but rather with reducing the class-size. It comes down to almost the same, but makes you look at the problem from a different angle (as #Brian Rasmussen suggests, your class seems to have to many responsibilities).

What you have is a classic example a known design antipattern called the blob. Take some time to read the article I point here, and maybe you may find something useful. Besides, if this project is as big as it looks, you should consider some design to prevent growing into code that you can't control.

This isn't an answer to the big problem, but a theoretical solution to a specific piece of it:
Figure out where you want to split the big file into subfiles. Put comments in some special format at each of those points.
Write a fairly trivial script that will break the file apart into subfiles at those points. (Perhaps the special comments have embedded filenames that the script can use as instructions for how to split it.) It should preserve the comments as part of the splitting.
Run the script. Delete the original file.
When you need to merge from a branch, first recreate the big file by concatenating the pieces back together, do the merge, and then re-split it.
Also, if you want to preserve the SCC file history, I expect the best way to do that is to tell your source control system that the individual piece files are copies of the original. Then it will preserve the history of the sections that were kept in that file, although of course it will also record that large parts were "deleted".

One way to split it without too much danger would be to take a historic look at all the line changes. Are there certain functions that are more stable than others? Hot spots of change if you will.
If a line hasn't been changed in a few years you can probably move it to another file without too much worry. I'd take a look at the source annotated with the last revision that touched a given line and see if there are any functions you could pull out.

Wow, sounds great. I think explaining to your boss, that you need a lot of time to refactor the beast is worth a try. If he doesn't agree, quitting is an option.
Anyway, what I suggest is basically throwing out all the implementation and regrouping it into new modules, let's call those "global services". The "main module" would only forward to those services and ANY new code you write will use them instead of the "main module". This should be feasible in a reasonable amount of time (because it's mostly copy and paste), you don't break existing code and you can do it one maintenance version at a time. And if you still have any time left, you can spend it refactoring all old depending modules to also use the global services.

Do not ever touch this file and the code again!
Treat is like something you are stuck with. Start writing adapters for the functionality encoded there.
Write new code in different units and talk only to adapters which encapsulate the functionality of the monster.
... if only one of the above is not possible, quit the job and get you a new one.

My sympathies - in my previous job I encountered a similar situation with a file that was several times larger than the one you have to deal with. Solution was:
Write code to exhaustively test the function in the program in question. Sounds like you won't already have this in hand...
Identify some code that can be abstracted out into a helper/utilities class. Need not be big, just something that is not truly part of your 'main' class.
Refactor the code identified in 2. into a separate class.
Rerun your tests to ensure nothing got broken.
When you have time, goto 2. and repeat as required to make the code manageable.
The classes you build in step 3. iterations will likely grow to absorb more code that is appropriate to their newly-clear function.
I could also add:
0: buy Michael Feathers' book on working with legacy code
Unfortunately this type of work is all too common, but my experience is that there is great value in being able to make working but horrid code incrementally less horrid while keeping it working.

Consider ways to rewrite the entire application in a more sensible way. Maybe rewrite a small section of it as a prototype to see if your idea is feasible.
If you've identified a workable solution, refactor the application accordingly.
If all attempts to produce a more rational architecture fail, then at least you know the solution is probably in redefining the program's functionality.

My 0.05 eurocents:
Re-design the whole mess, split it into subsystems taking into account the technical and business requirements (=many parallel maintenance tracks with potentially different codebase for each, there is obviously a need for high modifiability, etc.).
When splitting into subsystems, analyze the places which have most changed and separate those from the unchanging parts. This should show you the trouble-spots. Separate the most changing parts to their own modules (e.g. dll) in such a way that the module API can be kept intact and you don't need to break BC all the time. This way you can deploy different versions of the module for different maintenance branches, if needed, while having the core unchanged.
The redesign will likely need to be a separate project, trying to do it to a moving target will not work.
As for the source code history, my opinion: forget it for the new code. But keep the history somewhere so you can check it, if needed. I bet you won't need it that much after the beginning.
You most likely need to get management buy-in for this project. You can argue perhaps with faster development time, less bugs, easier maintaining and less overall chaos. Something along the lines of "Proactively enable the future-proofness and maintenance viability of our critical software assets" :)
This is how I'd start to tackle the problem at least.

Start by adding comments to it. With reference to where functions are called and if you can move things around. This can get things moving. You really need to assess how brittle the code base it. Then move common bits of functionality together. Small changes at a time.

Another book you may find interesting/helpful is Refactoring.

Something I find useful to do (and I'm doing it now although not at the scale you face), is to extract methods as classes (method object refactoring). The methods that differ across your different versions will become different classes which can be injected into a common base to provide the different behaviour you need.

I found this sentence to be the most interesting part of your post:
> The file is used and actively changed in several (> 10) maintenance versions of our product and so it is really hard to refactor it
First, I would recommend that you use a source control system for developing these 10 + maintenance versions that supports branching.
Second, I would create ten branches (one for each of your maintenance versions).
I can feel you cringing already! But either your source control isn't working for your situation because of a lack of features, or it's not being used correctly.
Now to the branch you work on - refactor it as you see fit, safe in the knowledge that you'll not upset the other nine branches of your product.
I would be a bit concerned that you have so much in your main() function.
In any projects I write, I would use main() only perform initialization of core objects - like a simulation or application object - these classes is where the real work should go on.
I would also initialize an application logging object in main for use globally throughout the program.
Finally, in main I also add leak detection code in preprocessor blocks that ensure it's only enabled in DEBUG builds. This is all I would add to main(). Main() should be short!
You say that
> The file basically contains the "main class" (main internal work dispatching and coordination) of our program
It sounds like these two tasks could be split into two separate objects - a co-ordinator and a work dispatcher.
When you split these up, you may mess up your "SCC workflow", but it sounds like adhering stringently to your SCC workflow is causing software maintenance problems. Ditch it, now and don't look back, because as soon as you fix it, you'll begin to sleep easy.
If you're not able to make the decision, fight tooth and nail with your manager for it - your application needs to be refactored - and badly by the sounds of it! Don't take no for an answer!

As you've described it, the main issue is diffing pre-split vs post-split, merging in bug fixes etc.. Tool around it. It won't take that long to hardcode a script in Perl, Ruby, etc. to rip out most of the noise from diffing pre-split against a concatenation of post-split. Do whatever's easiest in terms of handling noise:
remove certain lines pre/during concatenation (e.g. include guards)
remove other stuff from the diff output if necessary
You could even make it so whenever there's a checkin, the concatenation runs and you've got something prepared to diff against the single-file versions.

"The file basically contains the "main class" (main internal work dispatching and coordination) of our program, so every time a feature is added, it also affects this file and every time it grows."
If that big SWITCH (which I think there is) becomes the main maintenance problem, you could refactor it to use dictionary and the Command pattern and remove all switch logic from the existing code to the loader, which populates that map, i.e.:
// declaration
std::map<ID, ICommand*> dispatchTable;
...
// populating using some loader
dispatchTable[id] = concreteCommand;
...
// using
dispatchTable[id]->Execute();

I think the easiest way to track the history of source when splitting a file would be something like this:
Make copies of the original source code, using whatever history-preserving copy commands your SCM system provides. You'll probably need to submit at this point, but there's no need yet to tell your build system about the new files, so that should be ok.
Delete code from these copies. That should not break the history for the lines you keep.

I think what I would do in this situation is bit the bullet and:
Figure out how I wanted to split the file up (based on the current development version)
Put an administrative lock on the file ("Nobody touch mainmodule.cpp after 5pm Friday!!!"
Spend your long weekend applying that change to the >10 maintenance versions (from oldest to newest), up to and including the current version.
Delete mainmodule.cpp from all supported versions of the software. It's a new Age - there is no more mainmodule.cpp.
Convince Management that you shouldn't be supporting more than one maintenance version of the software (at least without a big $$$ support contract). If each of your customers have their own unique version.... yeeeeeshhhh. I'd be adding compiler directives rather than trying to maintain 10+ forks.
Tracking old changes to the file is simply solved by your first check-in comment saying something like "split from mainmodule.cpp". If you need to go back to something recent, most people will remember the change, if it's 2 year from now, the comment will tell them where to look. Of course, how valuable will it be to go back more than 2 years to look at who changed the code and why?

Do very long methods always need refactoring?

I face a situation where we have many very long methods, 1000 lines or more.
To give you some more detail, we have a list of incoming high level commands, and each generates results in a longer (sometime huge) list of lower level commands. There's a factory creating an instance of a class for each incoming command. Each class has a process method, where all the lower level commands are generated added in sequence. As I said, these sequences of commands and their parameters cause quite often the process methods to reach thousands of lines.
There are a lot of repetitions. Many command patterns are shared between different commands, but the code is repeated over and over. That leads me to think refactoring would be a very good idea.
On the contrary, the specs we have come exactly in the same form as the current code. Very long list of commands for each incoming one. When I've tried some refactoring, I've started to feel uncomfortable with the specs. I miss the obvious analogy between the specs and code, and lose time digging into newly created common classes.
Then here the question: in general, do you think such very long methods would always need refactoring, or in a similar case it would be acceptable?
(unfortunately refactoring the specs is not an option)
edit:
I have removed every reference to "generate" cause it was actually confusing. It's not auto generated code.
class InCmd001 {
OutMsg process ( InMsg& inMsg ) {
OutMsg outMsg = OutMsg::Create();
OutCmd001 outCmd001 = OutCmd001::Create();
outCmd001.SetA( param.getA() );
outCmd001.SetB( inMsg.getB() );
outMsg.addCmd( outCmd001 );
OutCmd016 outCmd016 = OutCmd016::Create();
outCmd016.SetF( param.getF() );
outMsg.addCmd( outCmd016 );
OutCmd007 outCmd007 = OutCmd007::Create();
outCmd007.SetR( inMsg.getR() );
outMsg.addCmd( outCmd007 );
// ......
return outMsg;
}
}
here the example of one incoming command class (manually written in pseudo c++)

Code never needs refactoring. The code either works, or it doesn't. And if it works, the code doesn't need anything.
The need for refactoring comes from you, the programmer. The person reading, writing, maintaining and extending the code.
If you have trouble understanding the code, it needs to be refactored. If you would be more productive by cleaning up and refactoring the code, it needs to be refactored.
In general, I'd say it's a good idea for your own sake to refactor 1000+ line functions. But you're not doing it because the code needs it. You're doing it because that makes it easier for you to understand the code, test its correctness, and add new functionality.
On the other hand, if the code is automatically generated by another tool, you'll never need to read it or edit it. So what'd be the point in refactoring it?

I understand exactly where you're coming from, and can see exactly why you've structured your code the way it is, but it needs to change.
The uncertainty you feel when you attempt to refactor can be ameliorated by writing unit tests. If you've tests specific to each spec, then the code for each spec can be refactored until you're blue in the face, and you can have confidence in it.
A second option, is it possible to automatically generate your code from a data structure?
If you've a core suite of classes that do the donkey work and edge cases, you can auto-generate the repetitive 1000 line methods as often as you wish.
However, there are exceptions to every rule.
If the methods are a literal interpretation of the spec (very little additional logic), and the specs change infrequently, and the "common" portions (i.e. bits that happen to be the same right now) of the specs change at different times, and you're not going to be asked to get a 10x performance gain out of the code anytime soon, then (and only then) . . . you may be better off with what you have.
. . . but on the whole, refactor.

Yes, always. 1000 lines is at least 10x longer than any function should ever be, and I'm tempted to say 100x, except that when dealing with input parsing and validation it can become natural to write functions with 20 or so lines.
Edit: Just re-read your question and I'm not clear on one point - are you talking about machine generated code that no-one has to touch? In which case I would leave things as they are.

Refectoring is not the same as writing from scratch. While you should never write code like this, before you refactor it, you need to consider the costs of refactoring in terms of time spent, the associated risks in terms of breaking code that already works, and the net benefits in terms of future time saved. Refactor only if the net benefits outweigh the associated costs and risks.
Sometimes wrapping and rewriting can be a safer and more cost effective solution, even if it appears expensive at first glance.

Long methods need refactoring if they are maintained (and thus need to be understood) by humans.

As a rule of thumb, code for humans first. I don't agree with the common idea that functions need to be short. I think what you need to aim at is when a human reads your code they grok it quickly.
To this effect it's a good idea to simplify things as much as possible--but not more than that. It's a good idea to delegate roughly one task for each function. There is no rule as for what "roughly one task" means: you'll have to use your own judgement for that. But do recognize that a function split into too many other functions itself reduces readability. Think about the human being who reads your function for the first time: they would have to follow one function call after another, constantly context-switching and maintaining a stack in their mind. This is a task for machines, not for humans.
Find the balance.
Here, you see how important naming things is. You will see it is not that easy to choose names for variables and functions, it takes time, but on the other hand it can save a lot of confusion on the human reader's side. Again, find the balance between saving your time and the time of the friendly humans who will follow you.
As for repetition, it's a bad idea. It's something that needs to be fixed, just like a memory leak. It's a ticking bomb.
As others have said before me, changing code can be expensive. You need to do the thinking as for whether it will pay off to spend all this time and effort, facing the risks of change, for a better code. You will possibly lose lots of time and make yourself one headache after another now, in order to possibly save lots of time and headache later.

Take a look at the related question How many lines of code is too many?. There are quite a few tidbits of wisdom throughout the answers there.
To repost a quote (although I'll attempt to comment on it a little more here)... A while back, I read this passage from Ovid's journal:
I recently wrote some code for
Class::Sniff which would detect "long
methods" and report them as a code
smell. I even wrote a blog post about
how I did this (quelle surprise, eh?).
That's when Ben Tilly asked an
embarrassingly obvious question: how
do I know that long methods are a code
smell?
I threw out the usual justifications,
but he wouldn't let up. He wanted
information and he cited the excellent
book Code Complete as a
counter-argument. I got down my copy
of this book and started reading "How
Long Should A Routine Be" (page 175,
second edition). The author, Steve
McConnell, argues that routines should
not be longer than 200 lines. Holy
crud! That's waaaaaay to long. If a
routine is longer than about 20 or 30
lines, I reckon it's time to break it
up.
Regrettably, McConnell has the cheek
to cite six separate studies, all of
which found that longer routines were
not only not correlated with a greater
defect rate, but were also often
cheaper to develop and easier to
comprehend. As a result, the latest
version of Class::Sniff on github now
documents that longer routines may not
be a code smell after all. Ben was
right. I was wrong.
(The rest of the post, on TDD, is worth reading as well.)
Coming from the "shorter methods are better" camp, this gave me a lot to think about.
Previously my large methods were generally limited to "I need inlining here, and the compiler is being uncooperative", or "for one reason or another the giant switch block really does run faster than the dispatch table", or "this stuff is only called exactly in sequence and I really really don't want function call overhead here". All relatively rare cases.
In your situation, though, I'd have a large bias toward not touching things: refactoring carries some inherent risk, and it may currently outweigh the reward. (Disclaimer: I'm slightly paranoid; I'm usually the guy who ends up fixing the crashes.)
Consider spending your efforts on tests, asserts, or documentation that can strengthen the existing code and tilt the risk/reward scale before any attempt to refactor: invariant checks, bound function analysis, and pre/postcondition tests; any other useful concepts from DBC; maybe even a parallel implementation in another language (maybe something message oriented like Erlang would give you a better perspective, given your code sample) or even some sort of formal logical representation of the spec you're trying to follow if you have some time to burn.
Any of these kinds of efforts generally have a few results, even if you don't get to refactor the code: you learn something, you increase your (and your organization's) understanding of and ability to use the code and specifications, you might find a few holes that really do need to be filled now, and you become more confident in your ability to make a change with less chance of disastrous consequences.
As you gain a better understanding of the problem domain, you may find that there are different ways to refactor you hadn't thought of previously.
This isn't to say "thou shalt have a full-coverage test suite, and DBC asserts, and a formal logical spec". It's just that you are in a typically imperfect situation, and diversifying a bit -- looking for novel ways to approach the problems you find (maintainability? fuzzy spec? ease of learning the system?) -- may give you a small bit of forward progress and some increased confidence, after which you can take larger steps.
So think less from the "too many lines is a problem" perspective and more from the "this might be a code smell, what problems is it going to cause for us, and is there anything easy and/or rewarding we can do about it?"
Leaving it cooking on the backburner for a bit -- coming back and revisiting it as time and coincidence allows (e.g. "I'm working near the code today, maybe I'll wander over and see if I can't document the assumptions a bit better...") may produce good results. Then again, getting royally ticked off and deciding something must be done about the situation is also effective.
Have I managed to be wishy-washy enough here? My point, I think, is that the code smells, the patterns/antipatterns, the best practices, etc -- they're there to serve you. Experiment to get used to them, and then take what makes sense for your current situation, and leave the rest.

I think you first need to "refactor" the specs. If there are repetitions in the spec it also will become easier to read, if it makes use of some "basic building blocks".
Edit: As long as you cannot refactor the specs, I wouldn't change the code.
Coding style guides are all made for easier code maintenance, but in your special case the ease of maintenance is achieved by following the spec.
Some people here asked if the code is generated. In my opinion it does not matter: If the code follows the spec "line by line" it makes no difference if the code is generated or hand-written.

1000 thousand lines of code is nothing. We have functions that are 6 to 12 thousand lines long. Of course those functions are so big, that literally things get lost in there, and no tool can help us even look at high level abstractions of them. the code is now unfortunately incomprehensible.
My opinion of functions that are that big, is that they were not written by brilliant programmers but by incompetent hacks who shouldn't be left anywhere near a computer - but should be fired and left flipping burgers at McDonald's. Such code wreaks havok by leaving behind features that cannot be added to or improved upon. (too bad for the customer). The code is so brittle that it cannot be modified by anyone - even the original authors.
And yes, those methods should be refactored, or thrown away.

Do you ever have to read or maintain the generated code?
If yes, then I'd think some refactoring might be in order.
If no, then the higher-level language is really the language you're working with -- the C++ is just an intermediate representation on the way to the compiler -- and refactoring might not be necessary.

Looks to me that you've implemented a separate language within your application - have you considered going that way?

It has been my understanding that it's recommended that any method over 100 lines of code be refactored.

I think some rules may be a little different in his era when code is most commonly viewed in an IDE. If the code does not contain exploitable repetition, such that there are 1,000 lines which are going to be referenced once each, and which share a significant number of variables in a clear fashion, dividing the code into 100-line routines each of which is called once may not be that much of an improvement over having a well-formatted 1,000-line module which includes #region tags or the equivalent to allow outline-style viewing.
My philosophy is that certain layouts of code generally imply certain things. To my mind, when a piece of code is placed into its own routine, that suggests that the code will be usable in more than one context (exception: callback handlers and the like in languages which don't support anonymous methods). If code segment #1 leaves an object in an obscure state which is only usable by code segment #2, and code segment #2 is only usable on a data object which is left in the state created by #1, then absent some compelling reason to put the segments in different routines, they should appear in the same routine. If a program puts objects through a chain of obscure states extending for many hundreds of lines of code, it might be good to rework the design of the code to subdivide the operation into smaller pieces which have more "natural" pre- and post- conditions, but absent some compelling reason to do so, I would not favor splitting up the code without changing the design.

For further reading, I highly recommend the long, insightful, entertaining, and sometimes bitter discussion of this topic over on the Portland Pattern Repository.

I've seen cases where it is not the case (for example, creating an Excel spreadsheet in .Net often requires a lot of line of code for the formating of the sheet), but most of the time, the best thing would be to indeed refactor it.
I personally try to make a function small enough so it all appears on my screen (without affecting the readability of course).

1000 lines? Definitely they need to be refactored. Also not that, for example, default maximum number of executable statements is 30 in Checkstyle, well-known coding standard checker.

If you refactor, when you refactor, add some comments to explain what the heck it's doing.
If it had comments, it would be much less likely a candidate for refactoring, because it would already be easier to read and follow for someone starting from scratch.

Then here the question: in general, do
you think such very long methods would
always need refactoring,
if you ask in general, we will say Yes .
or in a
similar case it would be acceptable?
(unfortunately refactoring the specs
is not an option)
Sometimes are acceptable, but is very unusual, I will give you a pair of examples:
There are some 8 bit microcontrollers called Microchip PIC, that have only a fixed 8 level stack, so you can't nest more than 8 calls, then care must be taken to avoid "stack overflow", so in this special case having many small function (nested) is not the best way to go.
Other example is when doing optimization of code (at very low level) so you have to take account the jump and context saving cost. Use it with care.
EDIT:
Even in generated code, you could need to refactorize the way its generated, for example for memory saving, energy saving, generate human readable, beauty, who knows, etc..

There has been very good general advise, here a practical recommendation for your sample:
common patterns can be isolated in plain feeder methods:
void AddSimpleTransform(OutMsg & msg, InMsg const & inMsg,
int rotateBy, int foldBy, int gonkBy = 0)
{
// create & add up to three messages
}
You might even improve that by making this a member of OutMsg, and using a fluent interface, such that you can write
OutMsg msg;
msg.AddSimpleTransform(inMsg, 12, 17)
.Staple("print")
.AddArtificialRust(0.02);
which can be an additional improvement under circumstances.

Any advice for a developer given the task of enhancing & refactoring a business critical application?

Recently I inherited a business critical project at work to "enhance". The code has been worked on and passed through many hands over the past five years. Consultants and full-time employees who are no longer with the company have butchered this very delicate and overly sensitive application. Most of us have to deal with legacy code or this type of project... its part of being a developer... but...
There are zero units and zero system tests. Logic is inter-mingled (and sometimes duplicated for no reason) between stored procedures, views (yes, I said views) and code. Documentation? Yeah, right.
I am scared. Yes, very sacred to make even the most minimal of "tweak" or refactor. One little mishap, and there would be major income loss and potential legal issues for my employer.
So, any advice? My first thought would be to begin writing assertions/unit tests against the existing code. However, that can only go so far because there is a lot of logic embedded in stored procedures. (I know its possible to test stored procedures, but historically its much more difficult compared to unit testing source code logic).
Another or additional approach would be to compare the database state before and after the application has performed a function, make some code changes, then do database state compare.

I just rewrote thousands of lines of the most complex subsystem of an enterprise filesystem to make it multi-threaded, so all of this comes from experience. If the rewrite is justified (it is if the rewrite is being done to significantly enhance capabilities, or if existing code is coming in the way of putting in more enhancements), then here are the pointers:
You need to be confident in your own abilities first of all to do this. That comes only if you have enough prior experience with the technologies involved.
Communicate, communicate, communicate. Let all involved stake-holders know, this is a mess, this is risky, this cannot be done in a hurry, this will need to be done piece-meal - attack one area at a time.
Understand the system inside out. Document every nuance, trick and hack. Document the overall design. Ask any old-timers about historical reasons for the existence of any code you cannot justify. These are the mines you don't want to step on - you might think those are useless pieces of code and then regret later after getting rid of them.
Unit test. Work the system through any test-suite which already exists, otherwise first write the tests for existing code, if they don't exist.
Spew debugging code all over the place during the rewrite - asserts, logging, console prints (you should have the ability to turn them on and off, as well specify different levels of output i.e. control verbosity). This is a must in my experience, and helps tremendously during a rewrite.
When going through the code, make a list of all things that need to be done - things you need to find out, things you need to write tests for, things you need to ask questions about, notes to remind you how to refactor some piece of code, anything that can affect your rewrite... you cannot afford to forget anything! I do this using Outlook Tasks (just make sure whatever you use is always in front of you - this is the first app I open as soon as I sit down on the desk). If I get interrupted, I write down anything that I have been thinking about and hints about where to continue after coming back to the task.
Try avoiding hacks in your rewrite (that's one of the reasons you are rewriting it). Think about tough problems you encounter. Discuss them with other people and bounce off your ideas against them (nothing beats this), and put in clean solutions. Look at all the tasks you put into the todo list - make a 10,000 feet picture of existing design, then decide how the new rewrite would look like (in terms of modules, sub-modules, how they fit together etc.).
Tackle the toughest problems before any other. That'll save you from running into problems you cannot solve near the end of tunnel, and save you from taking any steps backward. Of course, you need to know what the toughest problems will be - so again, better document everything first during your forays into existing code.

Get a very firm list of requirements.
Make sure you have implicit requirements as well as explicit ones - i.e. what programs it has to work with, and how.
Write all scenarios and use cases for how it is currently being used.
Write a lot of unit tests.
Write a lot of integration tests to test the integration of the program with existing programs it has to work with.
Talk to everyone who uses the program to find out more implicit requirements.
Test, test, test changes before moving into production.
CYA :)

Two things, beyond #Sudhanshu's great list (and, to some extent, disagreeing with his #8):
First, be aware that untested code is buggy code - what you are starting with almost certainly does not work correctly, for any definition of "correct" other than "works just like the unmodified code". That is, be prepared to find unexpected behavior in the system, to ask experts in the system about that behavior, and for them to conclude that it's not working the way it should. Prepare them for it to - warn them that without tests or other documentation, there's no reason to think it works they way they think it's working.
Next: Refactor The Low-Hanging Fruit Take it easy, take it slow, take it very careful. Notice something easy in the code - duplication, say - and test the hell out of whatever methods contain the duplication, then eliminate it. Lather, rinse, repeat. Don't write tests for everything before making changes, but write tests for whatever you're changing. This way, it stays releasable at every stage and you are continuously adding value, continuously improving the code base.
I said "two things", but I guess I'll add a third: Manage expectations. Let your customer know how scared you are of this task; let them know how bad what they've got is. Let them know how slow progress will be, and let them know you'll keep them informed of that progress (and, of course, do it). Your customer may think s/he's asking for "just a little fix" - and the functionality may indeed change only a little - but that doesn't mean it's not going to be a lot of work and a lot of time. You understand that; your customer needs to, too.

I've had this problem before and I've asked around (before the days of stack overflow) and this book has always been recommended to me. http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052

Ask yourself this: what are you trying to achieve? What is your mission? How much time do you have? What is the measurement for success? What risks are there? How do you mitigate and deal with them?
Don't touch anything unless you know what it is you're trying to achieve.
The code might be "bad" but what does that mean? The code works right? So if you rewrite the code so it does the same thing you'll have spent a lot of time rewriting something introducing bugs along the way so the code does the same thing? To what end?
The simplest thing you can do is document what the system does. And I don't mean write mind-numbing Word documents no one will ever read. I mean writing tests on key functionality, refactoring the code if necessary to allow such tests to be written.

You said you are scared to touch the code because of legal, income loss and that there is zero documentation. So do you understand the code? The first thing you should do is document it and make sure you understand it before you even think about refactoring. Once you have done that and identified the problem areas make a list of your refactoring proposals in the order of maximum benefit with minimum changes and attack it incrementally. Refactoring makes additional sense if: the expected lifespan of the code will be long, new features will be added, bug fixes are numerous. As for testing the database state - I worked on a project recently where that is exactly what we did with success.

Is it possible to get a separation of the DB and non-DB parts, so that a DBA can take on the challenge of the stored procedures and databases themselves freeing you up to work on the other parts of the system? This also presumes that there is a DBA who can step up and take that part of the application.
If that isn't possible, then I'd make the suggestion of seeing how big is the codebase and if it is possible to get some assistance so it isn't all on you. While this could be seen as side-stepping responsibility, the point would be that things shouldn't be in just one person's hands usually as they can disappear at times.
Good luck!

Testing approach for algorithms with complex outputs

How to test a result of a program that is basically a black box? For example one year ago I had to write a B tree as a homework and I really struggled with testing the correctness. What strategies do you use in such scenarios? Visualization? Robust input-->result sets of testing data? What do you do when it is hard to get such data because the only way how to get them is your proper working program?
EDIT: I think that my question was misunderstood. There was no problem with understanding how B tree works. That is trivial. But writing robust tests for validating its proper functionality is not so trivial. I think that this school problem is similar to many practical REAL word scenarios and test cases. And sometimes understanding the domain is quite different from delivering working and correct program...
EDIT2: And yes, with B tree it is possible to validate proper behavior with pen and paper. But this is really dirty and not fun :) This is not working well with problems that requires huge amount of data for their validation...

I'm not sure these answers really capture the problem at hand. A B-tree's input and output aren't any different from those of any other dictionary---but the algorithm performs better, if it's implemented correctly. It's only really got two functions to test (add, and find) so theoretically, "black-box" testing of this single component should be fine. Designing for testability isn't the issue, since no matter how you do it the whole algorithm will be one component.
So the question is: when you have to implement subtle algorithms, the kinds with complicated output that you can't always understand in your head so well, how do you test them? I think there are three different strategies you can use:
Black-box test basic functionality. For the B-tree case, this is things like cwash suggested, and also, things like making sure that when you add an item, you can then find it, etc.
Test certain invariants that your algorithm should maintain (the B-tree should be balanced, values within nodes should be sorted, etc.)
A few, small "pencil-and-paper" tests may be necessary -- work the algorithm out by hand and check that it matches what your code does. But the big-data tests can all be of type 2. These can also be brittle, so unless you need to be really sure about your algorithm, you may want to avoid them.

If you do not grasp the problem at hand, how can you develop a solution to it? My suggestion would be to understand the domain enough to be able to work out the problem on paper and ensure that your program matches.

Consult with an expert on the subject.
I know if I have a convoluted procedure I'm trying to fix, I have no idea what the output should be after my changes, so I need to consult a fellow developer with more knowledge of the business need, and they are able to verify what I've done is correct.

I would focus on constructing test cases that exercise the functionality of your B-tree algorithm. I haven't looked at it for years, but I'm fairly sure you'll be able to find a documented sequence of steps to insert a set of values in a specific order, then validate that the leaf nodes are as they should be. If you construct your testing along these lines, you should be able to prove your implementation is correct.

The key is to know there is a balance between testing something to death and doing tests that adequately cover what should be covered. Edge cases, e.g null inputs or checking inputs are numeric by testing an alphabet character or a punctuation character, are likely most of the tests you'd need. To complement this there may be one or two common cases to handle to show the program can handle a non-edge case as well. To cover all valid input in most programs is overkill and would result in an overwhelmingly large amount of tests.

I think the answer to the question you're asking boils down to designing for testability. Often you get a testable design for free when you test-drive the development of the solution. But let's face it, when you're implementing a highly mathematical algorithm, this just doesn't fall out.
To make sure you have a testable design, you need to understand what a seam is. Then you need to know a few rules of thumb, such as avoiding statics, using polymorphism, and properly decomposing problems and separating concerns.
Watch "The Clean Code Talks -- Unit Testing" by Misko Hevery, I think it will help you wrap your head around it.

Try looking at it from a requirements point of view, rather than an implementation point of view. Before you write code, you must understand exactly what you want it to do.
Testing and requirements should be a matching pair. If you're having trouble defining tests, maybe it's because the requirements are not well-defined. That in turn implies that you may have bugs that aren't so much implementation bugs, but "lack of clear requirements" bugs. The code writer in that case would be working to a mental list of requirements that he/she thinks is requirements, but can't be sure, and they're not written down for independent understanding and verification.
I've struggled with software where the requirements weren't clear, because the customer couldn't even tell us what they wanted. But when we delivered to them, they sure could tell us then what they didn't like about it! A big part of software engineering is getting the requirements right before the coding begins. This is true on the high-level (overall product, with requirement input from customer) and also the smaller level (modules, individual functions, where requirements are internally defined by software team or individuals). It is still true to some degree I think for iterative development, although the high-level requirements are more fluid.

#Bystrik Jurina,
I often get involved in projects which involve conversions between disparate data formats. Most answers have focused on testing a B-tree or similar algorithm, but it seems that you're looking for a more general answer.
Most of my work is based on the command line. It may sounds like a contradiction, but one of the first tools I use is visualization. I'll write some methods to write out my data structures in a format that's easy to consume. This can (and usually does) include something that's visually clear. But often it also means something that I could easily parse with a smaller test program, or even import into Excel.
I'll start by focusing on the basic outline, and write a program that does the bare minimum of what I need to accomplish. If it's a multi-step process, this might mean implementing one step at a time and validating the results of each step before moving on. Or writing something that works only in specific cases, and then expanding the set of cases where it's expected to work. At first you can validate that the code works in the limited set of cases, such as for known input data. As the project moves forward, you can start logging warnings for cases you might not have tested, or for unexpected types of input data. This has drawbacks, but is a nice approach when you're dealing with a known set of input data
Validation techniques can include formal test cases, or informal programs that work to challenge your assumptions. It could mean writing a basic driver program to exercise the "core" routines. A good example would be to add a record to a database, then read it back and compare the original object against the one loaded from the database.
If you have trouble wrapping your head around the way a program functions, think about what it needs to accomplish. It might be easier to writing code that tests the way different inputs produce different outputs. Producing visualizations is a good help, because the act of deciding how to display the data can make you think about different conditions and focus in on the most critical parts of your data structures.
Often I've found that building a visualization brings me to admit that the way the data is being stored just isn't very clear. For a B-tree, the representation isn't very flexible. But for other cases, you may be using parallel arrays when a nested tree of objects would be more natural.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js