Do very long methods always need refactoring? - c++

I face a situation where we have many very long methods, 1000 lines or more.
To give you some more detail, we have a list of incoming high level commands, and each generates results in a longer (sometime huge) list of lower level commands. There's a factory creating an instance of a class for each incoming command. Each class has a process method, where all the lower level commands are generated added in sequence. As I said, these sequences of commands and their parameters cause quite often the process methods to reach thousands of lines.
There are a lot of repetitions. Many command patterns are shared between different commands, but the code is repeated over and over. That leads me to think refactoring would be a very good idea.
On the contrary, the specs we have come exactly in the same form as the current code. Very long list of commands for each incoming one. When I've tried some refactoring, I've started to feel uncomfortable with the specs. I miss the obvious analogy between the specs and code, and lose time digging into newly created common classes.
Then here the question: in general, do you think such very long methods would always need refactoring, or in a similar case it would be acceptable?
(unfortunately refactoring the specs is not an option)
edit:
I have removed every reference to "generate" cause it was actually confusing. It's not auto generated code.
class InCmd001 {
OutMsg process ( InMsg& inMsg ) {
OutMsg outMsg = OutMsg::Create();
OutCmd001 outCmd001 = OutCmd001::Create();
outCmd001.SetA( param.getA() );
outCmd001.SetB( inMsg.getB() );
outMsg.addCmd( outCmd001 );
OutCmd016 outCmd016 = OutCmd016::Create();
outCmd016.SetF( param.getF() );
outMsg.addCmd( outCmd016 );
OutCmd007 outCmd007 = OutCmd007::Create();
outCmd007.SetR( inMsg.getR() );
outMsg.addCmd( outCmd007 );
// ......
return outMsg;
}
}
here the example of one incoming command class (manually written in pseudo c++)

Code never needs refactoring. The code either works, or it doesn't. And if it works, the code doesn't need anything.
The need for refactoring comes from you, the programmer. The person reading, writing, maintaining and extending the code.
If you have trouble understanding the code, it needs to be refactored. If you would be more productive by cleaning up and refactoring the code, it needs to be refactored.
In general, I'd say it's a good idea for your own sake to refactor 1000+ line functions. But you're not doing it because the code needs it. You're doing it because that makes it easier for you to understand the code, test its correctness, and add new functionality.
On the other hand, if the code is automatically generated by another tool, you'll never need to read it or edit it. So what'd be the point in refactoring it?

I understand exactly where you're coming from, and can see exactly why you've structured your code the way it is, but it needs to change.
The uncertainty you feel when you attempt to refactor can be ameliorated by writing unit tests. If you've tests specific to each spec, then the code for each spec can be refactored until you're blue in the face, and you can have confidence in it.
A second option, is it possible to automatically generate your code from a data structure?
If you've a core suite of classes that do the donkey work and edge cases, you can auto-generate the repetitive 1000 line methods as often as you wish.
However, there are exceptions to every rule.
If the methods are a literal interpretation of the spec (very little additional logic), and the specs change infrequently, and the "common" portions (i.e. bits that happen to be the same right now) of the specs change at different times, and you're not going to be asked to get a 10x performance gain out of the code anytime soon, then (and only then) . . . you may be better off with what you have.
. . . but on the whole, refactor.

Yes, always. 1000 lines is at least 10x longer than any function should ever be, and I'm tempted to say 100x, except that when dealing with input parsing and validation it can become natural to write functions with 20 or so lines.
Edit: Just re-read your question and I'm not clear on one point - are you talking about machine generated code that no-one has to touch? In which case I would leave things as they are.

Refectoring is not the same as writing from scratch. While you should never write code like this, before you refactor it, you need to consider the costs of refactoring in terms of time spent, the associated risks in terms of breaking code that already works, and the net benefits in terms of future time saved. Refactor only if the net benefits outweigh the associated costs and risks.
Sometimes wrapping and rewriting can be a safer and more cost effective solution, even if it appears expensive at first glance.

Long methods need refactoring if they are maintained (and thus need to be understood) by humans.

As a rule of thumb, code for humans first. I don't agree with the common idea that functions need to be short. I think what you need to aim at is when a human reads your code they grok it quickly.
To this effect it's a good idea to simplify things as much as possible--but not more than that. It's a good idea to delegate roughly one task for each function. There is no rule as for what "roughly one task" means: you'll have to use your own judgement for that. But do recognize that a function split into too many other functions itself reduces readability. Think about the human being who reads your function for the first time: they would have to follow one function call after another, constantly context-switching and maintaining a stack in their mind. This is a task for machines, not for humans.
Find the balance.
Here, you see how important naming things is. You will see it is not that easy to choose names for variables and functions, it takes time, but on the other hand it can save a lot of confusion on the human reader's side. Again, find the balance between saving your time and the time of the friendly humans who will follow you.
As for repetition, it's a bad idea. It's something that needs to be fixed, just like a memory leak. It's a ticking bomb.
As others have said before me, changing code can be expensive. You need to do the thinking as for whether it will pay off to spend all this time and effort, facing the risks of change, for a better code. You will possibly lose lots of time and make yourself one headache after another now, in order to possibly save lots of time and headache later.

Take a look at the related question How many lines of code is too many?. There are quite a few tidbits of wisdom throughout the answers there.
To repost a quote (although I'll attempt to comment on it a little more here)... A while back, I read this passage from Ovid's journal:
I recently wrote some code for
Class::Sniff which would detect "long
methods" and report them as a code
smell. I even wrote a blog post about
how I did this (quelle surprise, eh?).
That's when Ben Tilly asked an
embarrassingly obvious question: how
do I know that long methods are a code
smell?
I threw out the usual justifications,
but he wouldn't let up. He wanted
information and he cited the excellent
book Code Complete as a
counter-argument. I got down my copy
of this book and started reading "How
Long Should A Routine Be" (page 175,
second edition). The author, Steve
McConnell, argues that routines should
not be longer than 200 lines. Holy
crud! That's waaaaaay to long. If a
routine is longer than about 20 or 30
lines, I reckon it's time to break it
up.
Regrettably, McConnell has the cheek
to cite six separate studies, all of
which found that longer routines were
not only not correlated with a greater
defect rate, but were also often
cheaper to develop and easier to
comprehend. As a result, the latest
version of Class::Sniff on github now
documents that longer routines may not
be a code smell after all. Ben was
right. I was wrong.
(The rest of the post, on TDD, is worth reading as well.)
Coming from the "shorter methods are better" camp, this gave me a lot to think about.
Previously my large methods were generally limited to "I need inlining here, and the compiler is being uncooperative", or "for one reason or another the giant switch block really does run faster than the dispatch table", or "this stuff is only called exactly in sequence and I really really don't want function call overhead here". All relatively rare cases.
In your situation, though, I'd have a large bias toward not touching things: refactoring carries some inherent risk, and it may currently outweigh the reward. (Disclaimer: I'm slightly paranoid; I'm usually the guy who ends up fixing the crashes.)
Consider spending your efforts on tests, asserts, or documentation that can strengthen the existing code and tilt the risk/reward scale before any attempt to refactor: invariant checks, bound function analysis, and pre/postcondition tests; any other useful concepts from DBC; maybe even a parallel implementation in another language (maybe something message oriented like Erlang would give you a better perspective, given your code sample) or even some sort of formal logical representation of the spec you're trying to follow if you have some time to burn.
Any of these kinds of efforts generally have a few results, even if you don't get to refactor the code: you learn something, you increase your (and your organization's) understanding of and ability to use the code and specifications, you might find a few holes that really do need to be filled now, and you become more confident in your ability to make a change with less chance of disastrous consequences.
As you gain a better understanding of the problem domain, you may find that there are different ways to refactor you hadn't thought of previously.
This isn't to say "thou shalt have a full-coverage test suite, and DBC asserts, and a formal logical spec". It's just that you are in a typically imperfect situation, and diversifying a bit -- looking for novel ways to approach the problems you find (maintainability? fuzzy spec? ease of learning the system?) -- may give you a small bit of forward progress and some increased confidence, after which you can take larger steps.
So think less from the "too many lines is a problem" perspective and more from the "this might be a code smell, what problems is it going to cause for us, and is there anything easy and/or rewarding we can do about it?"
Leaving it cooking on the backburner for a bit -- coming back and revisiting it as time and coincidence allows (e.g. "I'm working near the code today, maybe I'll wander over and see if I can't document the assumptions a bit better...") may produce good results. Then again, getting royally ticked off and deciding something must be done about the situation is also effective.
Have I managed to be wishy-washy enough here? My point, I think, is that the code smells, the patterns/antipatterns, the best practices, etc -- they're there to serve you. Experiment to get used to them, and then take what makes sense for your current situation, and leave the rest.

I think you first need to "refactor" the specs. If there are repetitions in the spec it also will become easier to read, if it makes use of some "basic building blocks".
Edit: As long as you cannot refactor the specs, I wouldn't change the code.
Coding style guides are all made for easier code maintenance, but in your special case the ease of maintenance is achieved by following the spec.
Some people here asked if the code is generated. In my opinion it does not matter: If the code follows the spec "line by line" it makes no difference if the code is generated or hand-written.

1000 thousand lines of code is nothing. We have functions that are 6 to 12 thousand lines long. Of course those functions are so big, that literally things get lost in there, and no tool can help us even look at high level abstractions of them. the code is now unfortunately incomprehensible.
My opinion of functions that are that big, is that they were not written by brilliant programmers but by incompetent hacks who shouldn't be left anywhere near a computer - but should be fired and left flipping burgers at McDonald's. Such code wreaks havok by leaving behind features that cannot be added to or improved upon. (too bad for the customer). The code is so brittle that it cannot be modified by anyone - even the original authors.
And yes, those methods should be refactored, or thrown away.

Do you ever have to read or maintain the generated code?
If yes, then I'd think some refactoring might be in order.
If no, then the higher-level language is really the language you're working with -- the C++ is just an intermediate representation on the way to the compiler -- and refactoring might not be necessary.

Looks to me that you've implemented a separate language within your application - have you considered going that way?

It has been my understanding that it's recommended that any method over 100 lines of code be refactored.

I think some rules may be a little different in his era when code is most commonly viewed in an IDE. If the code does not contain exploitable repetition, such that there are 1,000 lines which are going to be referenced once each, and which share a significant number of variables in a clear fashion, dividing the code into 100-line routines each of which is called once may not be that much of an improvement over having a well-formatted 1,000-line module which includes #region tags or the equivalent to allow outline-style viewing.
My philosophy is that certain layouts of code generally imply certain things. To my mind, when a piece of code is placed into its own routine, that suggests that the code will be usable in more than one context (exception: callback handlers and the like in languages which don't support anonymous methods). If code segment #1 leaves an object in an obscure state which is only usable by code segment #2, and code segment #2 is only usable on a data object which is left in the state created by #1, then absent some compelling reason to put the segments in different routines, they should appear in the same routine. If a program puts objects through a chain of obscure states extending for many hundreds of lines of code, it might be good to rework the design of the code to subdivide the operation into smaller pieces which have more "natural" pre- and post- conditions, but absent some compelling reason to do so, I would not favor splitting up the code without changing the design.

For further reading, I highly recommend the long, insightful, entertaining, and sometimes bitter discussion of this topic over on the Portland Pattern Repository.

I've seen cases where it is not the case (for example, creating an Excel spreadsheet in .Net often requires a lot of line of code for the formating of the sheet), but most of the time, the best thing would be to indeed refactor it.
I personally try to make a function small enough so it all appears on my screen (without affecting the readability of course).

1000 lines? Definitely they need to be refactored. Also not that, for example, default maximum number of executable statements is 30 in Checkstyle, well-known coding standard checker.

If you refactor, when you refactor, add some comments to explain what the heck it's doing.
If it had comments, it would be much less likely a candidate for refactoring, because it would already be easier to read and follow for someone starting from scratch.

Then here the question: in general, do
you think such very long methods would
always need refactoring,
if you ask in general, we will say Yes .
or in a
similar case it would be acceptable?
(unfortunately refactoring the specs
is not an option)
Sometimes are acceptable, but is very unusual, I will give you a pair of examples:
There are some 8 bit microcontrollers called Microchip PIC, that have only a fixed 8 level stack, so you can't nest more than 8 calls, then care must be taken to avoid "stack overflow", so in this special case having many small function (nested) is not the best way to go.
Other example is when doing optimization of code (at very low level) so you have to take account the jump and context saving cost. Use it with care.
EDIT:
Even in generated code, you could need to refactorize the way its generated, for example for memory saving, energy saving, generate human readable, beauty, who knows, etc..

There has been very good general advise, here a practical recommendation for your sample:
common patterns can be isolated in plain feeder methods:
void AddSimpleTransform(OutMsg & msg, InMsg const & inMsg,
int rotateBy, int foldBy, int gonkBy = 0)
{
// create & add up to three messages
}
You might even improve that by making this a member of OutMsg, and using a fluent interface, such that you can write
OutMsg msg;
msg.AddSimpleTransform(inMsg, 12, 17)
.Staple("print")
.AddArtificialRust(0.02);
which can be an additional improvement under circumstances.

Related

Is it really better to have an unnecessary function call instead of using else?

So I had a discussion with a colleague today. He strongly suggested me to change a code from
if(condition){
function->setValue(true)
}
else{
function->setValue(false)
}
to
function->setValue(false)
if(condition){
function->setValue(true)
}
in order to avoid the 'else'. I disagreed, because - while it might improve readability to some degree - in the case of the if-condition being true, we have 1 absolutely unnecessary function call.
What do you guys think?
Meh.
To do this to just to avoid the else is silly (at least there should be a deeper rationale). There's no extra branching cost to it typically, especially after the optimizer goes through it.
Code compactness can sometimes be a desirable aesthetic, especially if a lot of time is spent skimming and searching through code than reading it line-by-line. There can be legit reasons to favor terser code sometimes, but it's always cons and pros. But even code compactness should not be about cramming logic into fewer lines of code so much as just straightforward logic.
Correctness here might be easier to achieve with one or the other. The point was made in a comment that you might not know the side effects associated with calling setValue(false), though I would suggest that's kind of moot. Functions should have minimal side effects, they should all be documented at the interface/usage level if they aren't totally obvious, and if we don't know exactly what they are, we should be spending more time looking up their documentation prior to calling them (and their side effects should not be changing once firm dependencies are established to them).
Given that, it may sometimes be easier to achieve correctness and maintain it with a solution that starts out initializing states to some default value, and using a form of code that opts in to overwrite it in specific branches of code. From that standpoint, what your colleague suggested may be valid as a way to avoid tripping over that code in the future. Then again, for a simple if/else pair of branches, it's hardly a big deal.
Don't worry about the cost of the extra most-likely-constant-time function call either way in this kind of knee-deep micro-level implementation case, especially with no super tight performance-critical loop around this code (and even then, still prefer to worry about that at least a little bit in hindsight after profiling).
I think there are far better things to think about than this kind of coding style, like testing procedure. Reliable code tends to need less revisiting, and has the freedom to be written in a wider variety of ways without causing disputes. Testing is what establishes reliability. The biggest disputes about coding style tend to follow teams where there's more toe-stepping and more debugging of the same bodies of code over and over and over from disparate people due to lack of reliability, modularity, excessive coupling, etc. It's a symptom of a problem but not necessarily the root cause.

Red, green, refactor - why refactor?

I am trying to learn TDD and unit testing concepts and I have seen the mantra: "red, green, refactor." I am curious about why should you refactor your code after the tests pass?
This makes no sense to me, because if the tests pass, then why are you messing with the code? I also see TDD mantras like "only write enough code to make the test pass."
The only reason I could come up with, is if to make the test pass with green, you just sloppily write any old code. You just hack together a solution to get a passing test. Then obviously the code is a mess, so you can clean it up.
EDIT:
I found this link on another stackoverflow post which I think confirms the only reason I came up with, that the original code to 'pass' the test can be very simple, even hardcoded: http://blog.extracheese.org/2009/11/how_i_started_tdd.html
Usually the first working version of the code - even if not a mess - still can be improved. So you improve it, making it cleaner, more readable, removing duplication, finding better variable/method names etc. This is refactoring. And since you have the tests, you can refactor safely, because the tests will show if you have inadvertently broken something.
Note that usually you are not writing code from scratch, but modifying/extending existing code to add/change functionality. And the existing code may not be ready to accommodate the new functionality seamlessly. So the first implementation of the new functionality may look awkward or inconvenient, or you may see that it is difficult to extend further. So you improve the design to incorporate all existing functionality in the simplest, cleanest possible way while still passing all the tests.
Your question is a rehash of the age old "if it works, don't fix it". However, as Martin Fowler explains in Refactoring, code can be broken in many different ways. Even if it passes all the tests, it can be hard to understand, thus hard to extend and maintain. Moreover, if it looks sloppy, future programmers will take even less care to keep it tidy, so it will deteriorate ever quicker, and eventually degrades into a complete unmaintainable mess. To prevent this, we refactor to always keep the code clean and tidy as much as possible. If we (or our predecessors) have already let it become messy, refactoring is a huge effort with no obvious immediate benefit for management and stakeholders; thus they can hardly be convinced to support a large scale refactoring in practice. Therefore we refactor in small, even trivial steps, after every code change.
I have seen the mantra: "red, green, refactor."
it's not a 'mantra', it's a routine.
I also see TDD mantras like "only write enough code to make the test pass."
That's a guideline.
now your question:
The only reason I could come up with, is if to make the test pass with green, you just sloppily write any old code. You just hack together a solution to get a passing test. Then obviously the code is a mess, so you can clean it up.
You're almost there. The key is in the 'Design' part of TDD. You're not only coding, you're still designing your solution. That means that the exact API might not be set in stone still, and your tests might not reflect the final design (because it's not done yet). While coding "only enough to pass the test", you will hit some issues that might change your mind and guide the design. Only after you have some working code you're able to improve it.
Also, the refactor step involves the whole code, not only what you've just written to pass the last test. As the coding advances, you have more and more complex interactions between all parts of your code, the best time to refactor it is as soon as it's working.
Precisely because of this very early refactoring step, you shouldn't worry about the quality of the first iteration. it's just a proof of concept that helps in the design.
It's hard to see how the OP's skepticism isn't justified. TDD's workflow is rooted in the avoidance of premature design decisions by imposing a significant cost, if not precluding, 'seat of the pants' coding that could quickly devolve into an ill-advised YAGNI safari.[1]
The mechanism for this deferral of premature design is the 'smallest possible test'/'smallest possible code' workflow that is designed to stave off the temptation to 'fix' a perceived shortcoming or requirement before it would ordinarily need to be addressed or even encountered, i.e, presumably the shortcoming would (ought?) to be addressed in some future test case mapped directly to an acceptance criteria that in turn captures a particular business objective.
Furthermore, tests in TDD are supposed to a) help clarify design requirements, b) surface problems with a design[2], and c) serve as project assets that capture and document the effort applied to a particular story, so substituting a self-directed refactoring effort for a properly composed test not only precludes any insight the test might provide but also denies management and project planners information on the true cost of implementing a particular feature.[3]
Accordingly, I would suggest that a new test case, purpose built for introducing an additional requirement into the design, is the proper way to address any perceived shortcoming beyond a stylistic change to the current code under test, and the 'Refactor' phase, however well-intentioned, flies in the face of this philosophy, and is in fact an invitation to do the very sort of premature, YAGNI design safaris that TDD is supposed to prevent. I believe that Robert Martin's version of the 3 rules is consistent with this interpretation. [4 - A blatant appeal to authority]
[1] The previously cited http://blog.extracheese.org/2009/11/how_i_started_tdd.html elegantly demonstrates the value of deferring design decisions until the last possible moment. (Although perhaps the Fibonacci sequence is a somewhat artificial example).
[2] See https://blog.thecodewhisperer.com/permalink/how-a-smell-in-the-tests-points-to-a-risk-in-the-design
[3] Adding a "tech" or "spike" story (smell or not) to the backlog would be the appropriate method for ensuring that formal processes are followed and development effort is documented and justified... and if you can't convince the Product Owner to add it, then you shouldn't be throwing time at it.
[4] http://www.butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd
Because you should never refactor non-working code. If you do, then you won't know whether the errors were originally in there or due to your refactoring. If they all pass before refactoring, then fail, then you know the change you did broke something.
They don't mean to write any sloppy old code to pass a test. There is a difference between minimal and sloppy. A zen garden is minimal, but not sloppy.
However, the minimal changes you made here and there, might, in retrospect, be better combined into some other procedure that is called by both of them. After getting both tests working separately is the time to refactor. It's easier to refactor than to try and guess an architecture that's going to minimally cover all the test cases.
You make the code behave correctly first, then factor it well. If you do it the other way around you run the risk of making a mess/duplication/code smells while fixing it.
It's usually easier to restructure working code into well factored code than it is to try and design well factored code upfront.
The reason for refactoring working code is for maintenance. You want to remove duplication for reasons such as only having to fix something in one place, and also knowing that when you fix something somewhere you haven't missed the same bug in the similar code elsewhere. You want to rename vars, methods, classes if their meaning has changed from what you originally intended.
Overall, writing working code is non-trivial, and writing well factored code is non-trivial. If you are trying to do both at once you may do neither to your full potential, so giving full attention to one first and then the other is useful.
Iterative, Evolutionary Refactoring is a good approach, but first...
Somethings that should not go unsaid...
To build on top of some high-level notes above, you should understand some important concepts from Complex Systems Theory. The key concepts to note circumvolve a system's environmental structure, how a systems grows, how it behaves, and how its components interact.
Sensitive Dependence Upon Initial Conditions (Chaos Theory):
A system's behavior will be amplified toward its most influential tendency -- meaning, if you've many Broken Windows which influence how a developer will write the next module or interact with an existing one, then this developer is more likely to break another window. Its even tempting to break a window just because its the only one not broken.
Entropy:
There are many, many definitions of entropy out there; one that I find becoming to Software Engineering is: The amount of energy in a system which cannot be used for additional work. This is why reusability is crucial. Entropy is found mostly in terms of duplicate logic and comprehensibility. Furthermore, this ties closely back to the Butterfly Effect (Sensitive Dependence Upon Initial Conditions) and Broken Windows -- the more duplicate logic, the more CopyPaste for additional implementations and it is more than 1X per implementation to maintain it all.
Variable Amplification and Dampening (Emergence Theory and Network Theory):
Breaking a bad design is a good implementation, though it seems all hell breaks loose when it happens the first few times. This is why it is sensible to have an Architecture which can support many adaptations. As your system heads toward entropy, you need a way for modules to interact with each other correctly -- this is where Interfaces come in. If each of your modules cannot interact unless they've agreed to a consistent contract. Without this, you'll see your system immediately start adapting to poor implementations -- and whichever wheel is the squeakiest will get the oil; the other modules will become a headache. So, not only do bad implementations cause more bad implementations, they also create undesirable behavior at the System's Scale -- causing your system, at large, to adapt to varying implementations and amplifying entropy at the highest scale. When this happens, all you can do is keep patching and hope that one change will not conflict with these adaptations -- causing emergent, unpredictable bugs.
The key to all this is to envelop your modules into their own, discrete subsystems, and provide a Defined Architecture which can allow them to communicate -- such as a Mediator. This brings a collection of (Decoupled) behavior into a Bottom-Up System which can then focus its complexity into a component designed exactly for it.
With this type of architectural approach, you shouldn't have significant pain on the 3rd term of "Red, Green, Refactor". The question is, how can your scrum master measure this in terms of benefit to the user & stakeholders?
You should not take the "only write enough code to make the test pass." mantra too literal.
Remember your application isn't ready just because all your tests passes. You clearly would like to refactor your code after tests passes to make sure the code is readable and well architechted. The tests are there to help you refactor so refactoring is a big part of TDD.
First, thanks for taking a look into Test Driven Development. It is an awesome technique that can be applied to many coding situations that can help you develop some great code while also giving you confidence in what the code can and can't do.
If you look at subtitle on the cover of Martin Fowler's book "Refactoring" it also answers your question - "Improving the Design Of Existing Code"
Refactorings are transformations to your code that should not alter the program's behavior.
By refactoring, you can make the program easier to maintain now, and 6 months from now, and it can also make the code easier for the next developer to understand.

Is optimizing a class for a unit test good practice, or is it premature?

I've seen (and searched for) a lot of questions on StackOverflow about premature optimization - word on the street is, it is the root of all evil. :P I confess that I'm often guilty of this; I don't really optimize for speed at the cost of code legibility, but I will rewrite my code in logical manners using datatypes and methods that seem more appropriate for the task (e.g. in Actionscript 3, using a typed Vector instead of an untyped Array for iteration) and if I can make my code more elegant, I will do so. This generally helps me understand my code, and I generally know why I'm making these changes.
At any rate, I was thinking today - in OOP, we promote encapsulation, attempting to hide the implementation and promote the interface, so that classes are loosely coupled. The idea is to make something that works without having to know what's going on internally - the black box idea.
As such, here's my question - is it wise to attempt to do deep optimization of code at the class level, since OOP promotes modularity? Or does this fall into the category of premature optimization? I'm thinking that, if you use a language that easily supports unit testing, you can test, benchmark, and optimize the class because it in itself is a module that takes input and generates output. But, as a single guy writing code, I don't know if it's wiser to wait until a project is fully finished to begin optimization.
For reference: I've never worked in a team before, so something that's obvious to developers who have this experience might be foreign to me.
Hope this question is appropriate for StackOverflow - I didn't find another one that directly answered my query.
Thanks!
Edit: Thinking about the question, I realize that "profiling" may have been the correct term instead of "unit test"; unit-testing checks that the module works as it should, while profiling checks performance. Additionally, a part of the question I should have asked before - does profiling individual modules after you've created them not reduce time profiling after the application is complete?
My question stems from the game development I'm trying to do - I have to create modules, such as a graphics engine, that should perform optimally (whether they will is a different story :D ). In an application where performance was less important, I probably wouldn't worry about this.
I don't really optimize for speed at the cost of code legibility, but I will rewrite my code in logical manners using datatypes and methods that seem more appropriate for the task [...] and if I can make my code more elegant, I will do so. This generally helps me understand my code
This isn't really optimization, rather refactoring for cleaner code and better design*. As such, it is a Good Thing, and it should indeed be practiced continuously, in small increments. Uncle Bob Martin (in his book Clean Code) popularized the Boy Scout Rule, adapted to software development: Leave the code cleaner than you found it.
So to answer your title question rephrased, yes, refactoring code to make it unit testable is a good practice. One "extreme" of this is Test Driven Development, where one writes the test first, then adds the code which makes the test pass. This way the code is created unit testable from the very beginning.
*Not to be nitpicky, just it is useful to clarify common terminology and make sure that we use the same terms in the same meaning.
True, optimization I believe should be left as a final task (although its good to be cognizant of where you might need to go back and optimize while writing your first draft). That's not to say that you shouldn't re-factor things iteratively in order to maintain order and cleanliness in the code. It is to say that if something currently serves the purpose and isn't botching a requirement of the application then the requirements should first be addressed as ultimately they are what you're responsible for delivering (unless the requirements include specifics on maximum request times or something along those lines). I agree with Korin's methodology as well, build for function first if time permits optimize to your hearts content (or the theoretical limit, whichever comes first).
The reason that premature optimization is a bad thing is this: it can take a lot of time and you don't know in advance where the best use of your time is likely to be.
For example, you could spend a lot of time optimizing a class, only to find that the bottleneck in your application is network latency or similar factor that is far more expensive in terms of execution time. Because at the outset you don't have a complete picture, premature optimization leads to a less than optimal use of your time. In this case, you'd probably have preferred to fix the latency issue than optimize class code.
I strongly believe that you should never reduce your code readability and good design because of performance optimizations.
If you are writing code where performance is critical it may be OK to lower the style and clarity of your code, but this does not hold true for the average enterprise application. Hardware evolves quickly and gets cheaper everyday. In the end you are writing code that is going to be read by other developers, so you'd better do a good job at it!
It's always beautiful to read code that has been carefully crafted, where every path has a test that helps you understand how it should be used. I don't really care if it is 50 ms slower than the spaghetti alternative which does lots of crazy stuff.
Yes you should skip optimizing for the unit test. Optimization when required usually makes the code more complex. Aim for simplicity. If you optimize for the unit test you may actually de-optimize for production.
If performance is really bad in the unit test, you may need to look at your design. Test in the application to see if performance is equally bad before optimizing.
EDIT: De-optimization is likely to occur when the data being handled varies is size. This is most likely to occur will classes that work with sets of data. Response may be linear, but originally slow, compared to geometric and originally fast. If the unit test uses a small set of data, then the geometric solution may be chosen for the unit test. When production hits the class with a large set of data performance tanks.
Sorting algorithms are a classic case for this kind of behavior and resulting de-optimizations. Many other algorithms have similar characteristics.
EDIT2: My most successful optimization was the sort routine for a report where data was stored on disk in a memory mapped file. The sort times were reasonable with moderate data sizes which did not require disk access. With larger sized data sets it could take days to process the data. Initial timings of the report showed; data selection 3 minutes, data sorting 3 days, and reporting 3 minutes. Investigation of the sort showed that it was a fully unoptimized bubble sort (n-1 full passes for a data set of size n), roughly n square in big O notation. Changing the sorting algorithm reduced the sort time for this report to 3 minutes. I would not have expected a unit test to cover this case, and the original code was as simple (fast) as you could get for small sets. The replacement was much more complex and slower for very small sets, but handled large sets faster with a more linear curve, n log n in big O notation. (Note: no optimization was attempted until we had metrics.)
In practice, I aim for a ten-fold improvement of a routine which takes at least 50% of the module run-time. Achieving this level of optimization for a routine using 55% of the run-time will save 50% of the total run-time.

Any advice for a developer given the task of enhancing & refactoring a business critical application?

Recently I inherited a business critical project at work to "enhance". The code has been worked on and passed through many hands over the past five years. Consultants and full-time employees who are no longer with the company have butchered this very delicate and overly sensitive application. Most of us have to deal with legacy code or this type of project... its part of being a developer... but...
There are zero units and zero system tests. Logic is inter-mingled (and sometimes duplicated for no reason) between stored procedures, views (yes, I said views) and code. Documentation? Yeah, right.
I am scared. Yes, very sacred to make even the most minimal of "tweak" or refactor. One little mishap, and there would be major income loss and potential legal issues for my employer.
So, any advice? My first thought would be to begin writing assertions/unit tests against the existing code. However, that can only go so far because there is a lot of logic embedded in stored procedures. (I know its possible to test stored procedures, but historically its much more difficult compared to unit testing source code logic).
Another or additional approach would be to compare the database state before and after the application has performed a function, make some code changes, then do database state compare.
I just rewrote thousands of lines of the most complex subsystem of an enterprise filesystem to make it multi-threaded, so all of this comes from experience. If the rewrite is justified (it is if the rewrite is being done to significantly enhance capabilities, or if existing code is coming in the way of putting in more enhancements), then here are the pointers:
You need to be confident in your own abilities first of all to do this. That comes only if you have enough prior experience with the technologies involved.
Communicate, communicate, communicate. Let all involved stake-holders know, this is a mess, this is risky, this cannot be done in a hurry, this will need to be done piece-meal - attack one area at a time.
Understand the system inside out. Document every nuance, trick and hack. Document the overall design. Ask any old-timers about historical reasons for the existence of any code you cannot justify. These are the mines you don't want to step on - you might think those are useless pieces of code and then regret later after getting rid of them.
Unit test. Work the system through any test-suite which already exists, otherwise first write the tests for existing code, if they don't exist.
Spew debugging code all over the place during the rewrite - asserts, logging, console prints (you should have the ability to turn them on and off, as well specify different levels of output i.e. control verbosity). This is a must in my experience, and helps tremendously during a rewrite.
When going through the code, make a list of all things that need to be done - things you need to find out, things you need to write tests for, things you need to ask questions about, notes to remind you how to refactor some piece of code, anything that can affect your rewrite... you cannot afford to forget anything! I do this using Outlook Tasks (just make sure whatever you use is always in front of you - this is the first app I open as soon as I sit down on the desk). If I get interrupted, I write down anything that I have been thinking about and hints about where to continue after coming back to the task.
Try avoiding hacks in your rewrite (that's one of the reasons you are rewriting it). Think about tough problems you encounter. Discuss them with other people and bounce off your ideas against them (nothing beats this), and put in clean solutions. Look at all the tasks you put into the todo list - make a 10,000 feet picture of existing design, then decide how the new rewrite would look like (in terms of modules, sub-modules, how they fit together etc.).
Tackle the toughest problems before any other. That'll save you from running into problems you cannot solve near the end of tunnel, and save you from taking any steps backward. Of course, you need to know what the toughest problems will be - so again, better document everything first during your forays into existing code.
Get a very firm list of requirements.
Make sure you have implicit requirements as well as explicit ones - i.e. what programs it has to work with, and how.
Write all scenarios and use cases for how it is currently being used.
Write a lot of unit tests.
Write a lot of integration tests to test the integration of the program with existing programs it has to work with.
Talk to everyone who uses the program to find out more implicit requirements.
Test, test, test changes before moving into production.
CYA :)
Two things, beyond #Sudhanshu's great list (and, to some extent, disagreeing with his #8):
First, be aware that untested code is buggy code - what you are starting with almost certainly does not work correctly, for any definition of "correct" other than "works just like the unmodified code". That is, be prepared to find unexpected behavior in the system, to ask experts in the system about that behavior, and for them to conclude that it's not working the way it should. Prepare them for it to - warn them that without tests or other documentation, there's no reason to think it works they way they think it's working.
Next: Refactor The Low-Hanging Fruit Take it easy, take it slow, take it very careful. Notice something easy in the code - duplication, say - and test the hell out of whatever methods contain the duplication, then eliminate it. Lather, rinse, repeat. Don't write tests for everything before making changes, but write tests for whatever you're changing. This way, it stays releasable at every stage and you are continuously adding value, continuously improving the code base.
I said "two things", but I guess I'll add a third: Manage expectations. Let your customer know how scared you are of this task; let them know how bad what they've got is. Let them know how slow progress will be, and let them know you'll keep them informed of that progress (and, of course, do it). Your customer may think s/he's asking for "just a little fix" - and the functionality may indeed change only a little - but that doesn't mean it's not going to be a lot of work and a lot of time. You understand that; your customer needs to, too.
I've had this problem before and I've asked around (before the days of stack overflow) and this book has always been recommended to me. http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052
Ask yourself this: what are you trying to achieve? What is your mission? How much time do you have? What is the measurement for success? What risks are there? How do you mitigate and deal with them?
Don't touch anything unless you know what it is you're trying to achieve.
The code might be "bad" but what does that mean? The code works right? So if you rewrite the code so it does the same thing you'll have spent a lot of time rewriting something introducing bugs along the way so the code does the same thing? To what end?
The simplest thing you can do is document what the system does. And I don't mean write mind-numbing Word documents no one will ever read. I mean writing tests on key functionality, refactoring the code if necessary to allow such tests to be written.
You said you are scared to touch the code because of legal, income loss and that there is zero documentation. So do you understand the code? The first thing you should do is document it and make sure you understand it before you even think about refactoring. Once you have done that and identified the problem areas make a list of your refactoring proposals in the order of maximum benefit with minimum changes and attack it incrementally. Refactoring makes additional sense if: the expected lifespan of the code will be long, new features will be added, bug fixes are numerous. As for testing the database state - I worked on a project recently where that is exactly what we did with success.
Is it possible to get a separation of the DB and non-DB parts, so that a DBA can take on the challenge of the stored procedures and databases themselves freeing you up to work on the other parts of the system? This also presumes that there is a DBA who can step up and take that part of the application.
If that isn't possible, then I'd make the suggestion of seeing how big is the codebase and if it is possible to get some assistance so it isn't all on you. While this could be seen as side-stepping responsibility, the point would be that things shouldn't be in just one person's hands usually as they can disappear at times.
Good luck!

Overcoming bad habit of "fixing it later"

When I start writing code from scratch, I have a bad habit of quickly writing everything in one function, the whole time thinking "I'll make it more modular later". Then when later comes along, I have a working product and any attempts to fix it would mean creating functions and having to figure out what I need to pass.
It gets worst because it becomes extremely difficult to redesign classes when your project is almost done. For example, I usually do some planning before I start writing code, then when my project is done, I realized I could have made the classes more modular and/or I could have used inheritance. Basically, I don't think I do enough planning and I don't get more than one-level of abstraction.
So in the end, I'm stuck with a program with a large main function, one class and a few helper functions. Needless to say, it is not very reusable.
Has anybody had the same problem and have any tips to overcome this? One thing I had in mind was to write the main function with pseduocode (without much detail but enough to see what objects and functions they need). Essentially a top-down approach.
Is this a good idea? Any other suggestions?
"First we make our habits, then they make us."
This seems to apply for both good and bad habits. Sounds like a bad one has taken hold of you.
Practice being more modular up front until it's "just the way I do things."
Yes, the solution is easy, although it takes time to get used to it.
Never claim there will be a "later", where you sit down and just do refactoring. Instead, continue adding functionality to your code (or tests) and during this phase perform small, incremental refactorings. The "later" will basically be "always", but hidden in the phase where you are actually doing something new every time.
I find the TDD Red-Green-Refactor discipline works wonders.
My rule of thumb is that anything longer than 20 LoC should be clean. IME every project stands on a few "just-a-proof-of-concept"s that were never intended to end up in production code. Since this seems inevitable though, even 20 lines of proof-of-concept code should be clear, because they might end up being one of the foundations of a big project.
My approach is top-down. I write
while( obj = get_next_obj(data) ) {
wibble(obj);
fumble(obj);
process( filter(obj) );
}
and only start to write all these functions later. (Usually they are inline and go into the unnamed namespace. Sometimes they turn out to be one-liners and then I might eliminate them later.)
This way I also avoid to have to comment the algorithms: The function names are explanation enough.
You pretty much identified the issue. Not having enough planning.
Spend some time analyzing the solution you're going to develop, break it down into pieces of functionality, identify how it would be best to implement them and try to separate the layers of the application (UI, business logic, data access layer, etc).
Think in terms of OOP and refactor as early as it makes sense. It's a lot cheaper than doing it after everything is built.
Write the main function minimally, with almost nothing in it. In most gui programs, sdl games programs, open gl, or anything with any kind of user interface at all, the main function should be nothing more than an event eating loop. It has to be, or there will always be long stretches of time where the computer seems unresponsive, and the operating system thinks considers maybe shutting it down because it's not responding to messages.
Once you get your main loop, quickly lock that down, only to be modified for bug fixes, not new functionality. This may just end up displacing the problem to another function, but having a monilithic function is rather difficult to do in an event based application anyway. You'll always need a million little event handlers.
Maybe you have a monolithic class. I've done that. Mainly the way to deal with it is to try and keep a mental or physical map of dependencies, and note where there's ... let's say, perforations, fissures where a group of functions doesn't explicitly depend on any shared state or variables with other functions in the class. There you can spin that cluster of functions off into a new class. If it's really a huge class, and really tangled up, I'd call that a code smell. Think about redesigning such a thing to be less huge and interdependant.
Another thing you can do is as you're coding, note that when a function grows to a size where it no longer fits on a single screen, it's probably too big, and at that point start thinking about how to break it down into multiple smaller functions.
Refactoring is a lot less scary if you have good tools to do it. I see you tagged your question as "C++" but the same goes for any language. Get an IDE where extracting and renaming methods, extracting variables, etc. is easy to do, and then learn how to use that IDE effectively. Then the "small, incremental refactorings" that Stefano Borini mentions will be less daunting.
Your approach isn't necessarily bad -- earlier more modular design might end up as over-engineering.
You do need to refactor -- this is a fact of life. The question is when? Too late, and the refactoring is too big a task and too risk-prone. Too early, and it might be over-engineering. And, as time goes on, you will need to refactor again .. and again. This is just part of the natural life-cycle of software.
The trick is to refactor soon, but not too soon. And frequently, but not too frequently. How soon and how frequently? That's why it's a art and not a science :)