Alternative to deep if-else trees?

Alternative to deep if-else trees? - if-statement

I have recently had the pleasure to observe a code that is essentially kilometers of very deep if-else statements. The code is continuously being updated, and keeping track of the possible outcomes and their interactions is becoming progressively harder. It feels like one can do better.
I have had an idea of maybe separating the if-tree into one file, and outcomes into another. Perhaps one could even construct a matrix, where every row would be an outcome, and every possible question one might ask would be a column. Then one could do some basic logical operations on such a matrix to automatically determine if a newly added branch interferes with already existing ones.
I'm aware this idea is very vague, but I am sure somebody must have thought about optimizing if-else trees in terms of human design and edit time.
Are there any industry standards for dealing with deep if-else trees in an efficient way? If yes, I would appreciate names and links
In this particular case, I am talking about C#, but I am also interested how the problem is addressed in general

Related

Good testing requires lots of classes, when does the this degrade performance? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
For good unit testing you should have allot of classes.
This will mean that you have allot of test classes but shouldn't actually have to spend more time maintaining them as they will be simpler and fewer of them will be hit per change compared to a smaller number of classes with more functionality per class.
Is there any real performance hit to an application running that has allot of classes (probably with DI) compared to one with fewer, fatter classes?
EDIT
Let me clarify.
I'm not talking about the algorithms, I was wondering on, in general, if allot of classes was ever a performance hit.
Do you think that allot of classes, I'm thinking of one public method per class and almost all private methods extracted to classes.
This is very extreme, but hypothetically, what would the problem be?

What feel right now is called "premature optimization": The fear/urge to fix a problem that might never ever happen.
Loading classes in general isn't slow because all the related code has been optimized to death since everyone uses it.
Also if class loading takes a few seconds per test run (= the whole run, not per test), that doesn't matter since your tests usually run much longer (many orders of magnitude more if you do it wrong).
That leaves the last and worst problem: Maintenance. Each line of code written is one that needs to be at least ignored when you maintain the code. Even ignoring code takes effort. So less code is always better.
But there is a lower limit here; you just need a minimum amount of code to encode all the features you need. Also, trying to reach the absolute minimum for the required features takes a lot of effort, so most code only gets (somewhat) close.
Unfortunately, that still isn't the whole picture: If you have well written code, then you won't have to look at it - it will just work. So the maintenance cost of well written code is much lower than, say, compact, fast but incomprehensible code.
That means that code is cheaper when it's easy to understand. Which means splitting it up into independent bits will make it cheaper. That will, because of the flaws of OOP, lead to a class explosion.
Unfortunately, our brainpower is limited. This means if a feature is spread over too many places, it will be harder to grasp the whole picture. This gets worse when the code isn't cut in exactly the right place (and it most often isn't) plus the fact that no one can writer perfect code over a long period of time (so the quality changes).
Conclusion: You can have too many or too few classes. There is no way to tell from the outside. It depends on how good you/your team are, how complex your problem is, how big the project is, etc. A good indicator is the number of open bugs in your bug database and experience.

Edit: You're asking if there is ever a performance hit. That's just plain the wrong question. There is always a performance hit, it's just negligible in context, and even when it's not, performance is almost always cheaper to solve using hardware rather than software, even now that clock speeds have plateaued. In those few cases where it is not cheaper to buy faster silicon, classes are not going to be your limiting factor, and the rest of this answer applies.
In my experience, most things that need performance enhancements are bound by things other than unit granularity. Using a faster data storage mechanism or index on your data, or a better in-memory collection architecture, or using a dictionary and loop instead of a nested loop, all make large jumps in performance possible. Once you have these taken care of, if you are still concerned about performance, you are left with a small number of options, most of which require very careful hand-optimization of code in an extremely performant language, which typically means getting someone who is a top-tier coder to do the core of an algorithm.
There is also the simple principle to remember: Beware Premature Optimization.

Is there any real performance hit to an application running that has allot of classes (probably with DI) compared to one with fewer, fatter classes?
High performance comes from implementing efficient algorithms. Efficient algorithms are often harder to code, and thus require better testing. Using many classes to ensure good testing makes it easier to code efficient algorithms, and thus provides better performance.

Is optimizing a class for a unit test good practice, or is it premature?

I've seen (and searched for) a lot of questions on StackOverflow about premature optimization - word on the street is, it is the root of all evil. :P I confess that I'm often guilty of this; I don't really optimize for speed at the cost of code legibility, but I will rewrite my code in logical manners using datatypes and methods that seem more appropriate for the task (e.g. in Actionscript 3, using a typed Vector instead of an untyped Array for iteration) and if I can make my code more elegant, I will do so. This generally helps me understand my code, and I generally know why I'm making these changes.
At any rate, I was thinking today - in OOP, we promote encapsulation, attempting to hide the implementation and promote the interface, so that classes are loosely coupled. The idea is to make something that works without having to know what's going on internally - the black box idea.
As such, here's my question - is it wise to attempt to do deep optimization of code at the class level, since OOP promotes modularity? Or does this fall into the category of premature optimization? I'm thinking that, if you use a language that easily supports unit testing, you can test, benchmark, and optimize the class because it in itself is a module that takes input and generates output. But, as a single guy writing code, I don't know if it's wiser to wait until a project is fully finished to begin optimization.
For reference: I've never worked in a team before, so something that's obvious to developers who have this experience might be foreign to me.
Hope this question is appropriate for StackOverflow - I didn't find another one that directly answered my query.
Thanks!
Edit: Thinking about the question, I realize that "profiling" may have been the correct term instead of "unit test"; unit-testing checks that the module works as it should, while profiling checks performance. Additionally, a part of the question I should have asked before - does profiling individual modules after you've created them not reduce time profiling after the application is complete?
My question stems from the game development I'm trying to do - I have to create modules, such as a graphics engine, that should perform optimally (whether they will is a different story :D ). In an application where performance was less important, I probably wouldn't worry about this.

I don't really optimize for speed at the cost of code legibility, but I will rewrite my code in logical manners using datatypes and methods that seem more appropriate for the task [...] and if I can make my code more elegant, I will do so. This generally helps me understand my code
This isn't really optimization, rather refactoring for cleaner code and better design*. As such, it is a Good Thing, and it should indeed be practiced continuously, in small increments. Uncle Bob Martin (in his book Clean Code) popularized the Boy Scout Rule, adapted to software development: Leave the code cleaner than you found it.
So to answer your title question rephrased, yes, refactoring code to make it unit testable is a good practice. One "extreme" of this is Test Driven Development, where one writes the test first, then adds the code which makes the test pass. This way the code is created unit testable from the very beginning.
*Not to be nitpicky, just it is useful to clarify common terminology and make sure that we use the same terms in the same meaning.

True, optimization I believe should be left as a final task (although its good to be cognizant of where you might need to go back and optimize while writing your first draft). That's not to say that you shouldn't re-factor things iteratively in order to maintain order and cleanliness in the code. It is to say that if something currently serves the purpose and isn't botching a requirement of the application then the requirements should first be addressed as ultimately they are what you're responsible for delivering (unless the requirements include specifics on maximum request times or something along those lines). I agree with Korin's methodology as well, build for function first if time permits optimize to your hearts content (or the theoretical limit, whichever comes first).

The reason that premature optimization is a bad thing is this: it can take a lot of time and you don't know in advance where the best use of your time is likely to be.
For example, you could spend a lot of time optimizing a class, only to find that the bottleneck in your application is network latency or similar factor that is far more expensive in terms of execution time. Because at the outset you don't have a complete picture, premature optimization leads to a less than optimal use of your time. In this case, you'd probably have preferred to fix the latency issue than optimize class code.

I strongly believe that you should never reduce your code readability and good design because of performance optimizations.
If you are writing code where performance is critical it may be OK to lower the style and clarity of your code, but this does not hold true for the average enterprise application. Hardware evolves quickly and gets cheaper everyday. In the end you are writing code that is going to be read by other developers, so you'd better do a good job at it!
It's always beautiful to read code that has been carefully crafted, where every path has a test that helps you understand how it should be used. I don't really care if it is 50 ms slower than the spaghetti alternative which does lots of crazy stuff.

Yes you should skip optimizing for the unit test. Optimization when required usually makes the code more complex. Aim for simplicity. If you optimize for the unit test you may actually de-optimize for production.
If performance is really bad in the unit test, you may need to look at your design. Test in the application to see if performance is equally bad before optimizing.
EDIT: De-optimization is likely to occur when the data being handled varies is size. This is most likely to occur will classes that work with sets of data. Response may be linear, but originally slow, compared to geometric and originally fast. If the unit test uses a small set of data, then the geometric solution may be chosen for the unit test. When production hits the class with a large set of data performance tanks.
Sorting algorithms are a classic case for this kind of behavior and resulting de-optimizations. Many other algorithms have similar characteristics.
EDIT2: My most successful optimization was the sort routine for a report where data was stored on disk in a memory mapped file. The sort times were reasonable with moderate data sizes which did not require disk access. With larger sized data sets it could take days to process the data. Initial timings of the report showed; data selection 3 minutes, data sorting 3 days, and reporting 3 minutes. Investigation of the sort showed that it was a fully unoptimized bubble sort (n-1 full passes for a data set of size n), roughly n square in big O notation. Changing the sorting algorithm reduced the sort time for this report to 3 minutes. I would not have expected a unit test to cover this case, and the original code was as simple (fast) as you could get for small sets. The replacement was much more complex and slower for very small sets, but handled large sets faster with a more linear curve, n log n in big O notation. (Note: no optimization was attempted until we had metrics.)
In practice, I aim for a ten-fold improvement of a routine which takes at least 50% of the module run-time. Achieving this level of optimization for a routine using 55% of the run-time will save 50% of the total run-time.

Secret to achieve good OO Design [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am a c++ programmer, and I am looking forward to learning and mastering OO design.I have done a lot of search and as we all know there is loads of material, books, tutorials etc on how to achieve a good OO design. Of course, I do understand a good design is something can only come up with loads of experience, individual talent, brilliance or in fact even by mere luck(exaggeration!).
But sure it all starts off with a solid beginning & building some strong basics.Can someone help me out by pointing out the right material on how to start off this quest of learning designing right from the stage of identifying objects, classes etc to the stage of using design patterns.
Having said that I am a programmer but I have not had a experience in designing.Can you please help me take someone help me out in this transition from a programmer to a designer?
Any tips,suggestions,advice will be helpful.
[Edit]Thanks for the links and answers, I need to get myself in to that :) As i mentioned before I am a C++ programmer and I do understand the OO basic concepts as such, like inheritance, abstraction, polymorphism, and having written code in C++ do understand a few of the design patterns as well.what i dont understand is the basic thought process with which one should approach a requirement. The nitty grittys of how to appraoch and decide on what classes should be made, and how to define or conclude on relationships they should have amongst themselves.Knowing the concepts(t some extent) but not knowing how to apply them is the problem i seem to have :( Any suggestions about that?

(very) Simple, but not simplist design (simple enough design if you prefer) : K.I.S.S.
Prefer flat hierarchies, avoid deep hierarchies.
Separation of concerns is essential.
Consider other "paradigms" than OO when it don't seem simple or elegant enough.
More generally : D.R.Y. and Y.A.G.N.I help you achieve 1.

There is no secret. It's sweat, not magic.
It's not about doing one thing right. It's balancing many things that must not go wrong. Sometimes, they work in sync, sometimes, they work against each other. Design is only one group of these aspects. The best design doesn't help if the project fails (e.g. because it never ships).
The first rule I'd put forward is:
1. There are no absolutes
Follows directly from the "many things to balance. D.R.Y., Y.A.G.N.I. etc. are guidelines, strictly following them cannot guarantee good design, if followed by the letter they may make your project fail.
Example: D.R.Y. One of the most fundamental principles, yet studies show that complexity of small code snippets increases by a factor of 3 or more when they get isolated, due to pre/post condition checking, error handling, and generalization to multiple related cases. So the principle needs to be weakened to "D.R.Y. (at least, not to much)" - when to and when not is the hard part.
The second rule is not a very common one:
2. An interface must be simpler than the implementation
Sounds to trivial to be catchy. Yet, there's much to say about it:
The premise of OO was to manage program sizes that could not be managed with structured programming anymore. The primary mechanism is to encapsulate complexity: we can hide complexity behind a simpler interface, and then forget about that complexity.
Interface complexity involves the documentation, error handling specifications, performance guarantees (or their absence), etc. This means that e.g. reducing the interface declaration by introducing special cases isn't a reduction in complexity - just a shuffle.
3-N Here's where I put most of the other mentions, that have been explained already very well.
Separation of Concerns, K.I.S.S, SOLID principle, D.R.Y., roughly in that order.
How to build software according to these guidelines?
Above guidelines help evaluating a piece of code. Unfortunately, there's no recipe how to get there. "Experienced" means that you have a good feel for how to structure your software, and some decisions just feel bad. Maybe all the principles are just rationnalizaitons after the fact.
The general path is to break down a system into responsibilities, until the individual pieces are managable.
There are formal processes for that, but these just work around the fact that what makes a good, isolated component is a subjective decision. But in the end, that's what we get paid for.
If you have a rough idea of the whole system, it isn't wrong to start with one of these pieces as a seed, and grow them into a "core". Top-down and bottom-up aren't antipodes.
Practice, practice, practice. Build a small program, make it run, change requirements, get it to run again. The "changing requirements" part you don't need to train a lot, we have customers for that.
Post-Project reviews - try to get used to them even for your personal projects. After it's sealed, done, evaluate what was good, what was bad. Consider the source code was thrown away - i.e. don't see that sessison as "what should be fixed?"
Conway's Law says that "A system reflects the structure of the organizaiton that built it." That applies to most complex software I've seen, and formal studies seem to confirm that. We can derive a few bits of information from that:
If structure is important, so are the people you work with.
Or Maybe structure isn't that important. There is not one right structure (just many wrong ones to avoid)

I'm going to quote Marcus Baker talking about how to achieve good OO design in a forum post here: http://www.sitepoint.com/forums/showpost.php?p=4671598&postcount=24
1) Take one thread of a use case.
2) Implement it any old how.
3) Take another thread.
4) Implement it any old how.
5) Look for commonality.
6) Factor the code so that commonality is collected into functions. Aim for clarity of code. No globals, pass everything.
7) Any block of code that is unclear, group into a function as well.
8) Implement another thread any old how, but use your existing functions if they are instant drop-ins.
9) Once working, factor again to remove duplication. By now you may find you are passing similar lumps of stuff around from function to function. To remove duplication, move these into objects.
10) Implement another thread once your code is perfect.
11) Refactor to avoid duplication until bored.
Now the OO bit...
12) By now some candidate higher roles should be emerging. Group those functions into roles by class.
13) Refactor again with the aim of clarity. No class bigger than a couple of pages of code, no method longer than 5 lines. No inheritance unless the variation is just a few lines of code.
From this point on you can cycle for a bit...
14) Implement a thread of use case any old how.
15) Refactor as above. Refactoring includes renaming objects and classes as their meanings evolve.
16) Repeat until bored.
Now the patterns stuff!
Once you have a couple of dozen classes and quite a bit of functionality up and running, you may notice some classes have very similar code, but no obvious split (no, don't use inheritance). At this point consult the patterns books for options on removing the duplication. Hint: you probably want "Strategy".
The following repeats...
17) Implement another thread of a use case any old how.
18) Refactor methods down to five lines or less, classes down to 2 pages or less (preferably a lot less), look for patterns to remove higher level duplication if it really makes the code cleaner.
19) Repeat until your top level constructors either have lot's of parameters, or you find yourself using "new" a lot to create objects inside other objects (this is bad).
Now we need to clean up the dependencies. Any class that does work should not use "new" to create things inside itself. Pass those sub objects from out side. Classes which do no mechanical work are allowed to use the "new" operator. They just assemble stuff - we'll call them factories. A factory is a role in itself. A class should have just one role, thus factories should be separate classes.
20) Factor out the factories.
Now we repeat again...
21) Implement another thread of a use case any old how.
22) Refactor methods down to five lines or less, classes down to 2 pages or less (preferably a lot less), look for patterns to remove higher level duplication if it really makes the code cleaner, make sure you use separate classes for factories.
23) Repeat until your top level classes have an excessive number of parameters (say 8+).
You've probably finished by now. If not, look up the dependency injection pattern...
24) Create (only) your top level classes with a dependency injector.
Then you can repeat again...
25) Implement another thread of a use case any old how.
26) Refactor methods down to five lines or less, classes down to 2 pages or less (preferably a lot less), look for patterns to remove higher level duplication if it really makes the code cleaner, make sure you use separate classes for factories, pass the top level dependencies (including the factories) via DI.
27) Repeat.
At any stage in this heuristic you will probably want to take a look at test driven development. At the very least it will stop regressions while you refactor.
Obviously, this is a pretty simple process, and the information contained therein shouldn't be applied to every situation, but I feel like Marcus gets it right, especially with regards to the process one should use to design OO code. After a while, you'll start doing it naturally, it'll just become second nature. But while learning to do so, this is a great set of steps to follow.

As you said, there is nothing like experience. You can read every existing book on the planet about this, you'll still not as good as if you practice.
Understanding the theory is good, but in my humble opinion, there is nothing like experience. I think the best way to learn and understand things completely is to apply them in some project(s).
There you'll face difficulties, you'll learn to solve them, sometimes perhaps with a bad solution : but still you'll learn. And if at any time something bothers you and you can't find how to solve it nicely, we'll be here on SO to help you ! :)

I can advice you the book "Head First Design Patterns" (search Amazon). It is a good starting point, before seriously diving into the gang of fours' bible, and it shows design principles and the most used patterns.

In a nutshell : Code, criticize, look for a well-known solution, implement it, and back to first step till you're (more or less) satisfied.
As often, the answer to this kind of question is : it depends. And in that case, it depends how you learn things. I'll tell you what work for me, for I face the very problem you describe, but it won't work with everybody and I would not say it's "the perfect answer".
I begin coding something, not too simple, not too complex. Then, I look at the code and I think : all right, what is wrong ? For that, you can use the first three "SOLID principles" :
Single responsibility (are all your classes serving a unique purpose ?)
Open/Close principle (if you want to add a service, your classes can be extended with inheritance, but there is no need to alter the basic functions or your current classes).
Liskov Substitution (all right, this one I can't explain simply, and I'd advise reading about it).
Don't try to master those and understand everything about them. Just use them as guideline to criticize your code. Think chiefly about "what if I want to do this now ?". "What if I work for a client, and he wants to add this or that ?". If your code is perfectly adaptable to any situation (which is almost impossible), you might have reached a very good design.
If it's not, consider a solution. What would you do ? Try to come with an idea. Then, read about design patterns and find one that could answer your problem. See if it matches your idea - often, it's the idea you had, but better expressed and developped. Now, try to implement it. It's going to take time, you'll often fail, it's frustrating, and that's normal.
Design is about experience, but experience is acquired by criticizing your own code. That's how you'll understand design, not as a cool thing to know, but as the basis for a solid code. It's not enough to know "all right, a good code has that and that". It's much better to have experienced why, to have failed and see what whas wrong. The trouble with design pattern is that they are very abstract. My method is a way (probably not the only one nor the best) to make them less abstract to you.

No-solo-work. Good designs are seldom created by a single person only. Talk to your colleagues. Discuss your design with others. And learn.
Don't be too smart. A complex hierarchy with 10 levels of inheritance is seldom a good design. Make sure that you can clearly explain how your design works. If you can't explain the basic principles in 5 minutes, your design is probably too complex.
Learn tricks from the masters: Alexandrescu, Meyers, Sutter, GoF.
Prefer extensibility over perfection. Source code written today will be insufficient in 3 years time. If you write your code to be perfect now, but inextensible, you will have a problem later. If you write your code to be extensible (but not perfect), you will still be able to adapt it later.

The core concepts in my mind are:
Encapsulation - Keep as much of you object hidden from both the prying eyes and sticky fingers of the outside world.
Abstraction - Hide as much of the inner workings of you object from the simple minds of the code that needs to use your objects
All the other concepts such as inheritance, polymorphism and design patters are about incorporating the two concepts above and still have objects that can solve real world problems.

Do very long methods always need refactoring?

I face a situation where we have many very long methods, 1000 lines or more.
To give you some more detail, we have a list of incoming high level commands, and each generates results in a longer (sometime huge) list of lower level commands. There's a factory creating an instance of a class for each incoming command. Each class has a process method, where all the lower level commands are generated added in sequence. As I said, these sequences of commands and their parameters cause quite often the process methods to reach thousands of lines.
There are a lot of repetitions. Many command patterns are shared between different commands, but the code is repeated over and over. That leads me to think refactoring would be a very good idea.
On the contrary, the specs we have come exactly in the same form as the current code. Very long list of commands for each incoming one. When I've tried some refactoring, I've started to feel uncomfortable with the specs. I miss the obvious analogy between the specs and code, and lose time digging into newly created common classes.
Then here the question: in general, do you think such very long methods would always need refactoring, or in a similar case it would be acceptable?
(unfortunately refactoring the specs is not an option)
edit:
I have removed every reference to "generate" cause it was actually confusing. It's not auto generated code.
class InCmd001 {
OutMsg process ( InMsg& inMsg ) {
OutMsg outMsg = OutMsg::Create();
OutCmd001 outCmd001 = OutCmd001::Create();
outCmd001.SetA( param.getA() );
outCmd001.SetB( inMsg.getB() );
outMsg.addCmd( outCmd001 );
OutCmd016 outCmd016 = OutCmd016::Create();
outCmd016.SetF( param.getF() );
outMsg.addCmd( outCmd016 );
OutCmd007 outCmd007 = OutCmd007::Create();
outCmd007.SetR( inMsg.getR() );
outMsg.addCmd( outCmd007 );
// ......
return outMsg;
}
}
here the example of one incoming command class (manually written in pseudo c++)

Code never needs refactoring. The code either works, or it doesn't. And if it works, the code doesn't need anything.
The need for refactoring comes from you, the programmer. The person reading, writing, maintaining and extending the code.
If you have trouble understanding the code, it needs to be refactored. If you would be more productive by cleaning up and refactoring the code, it needs to be refactored.
In general, I'd say it's a good idea for your own sake to refactor 1000+ line functions. But you're not doing it because the code needs it. You're doing it because that makes it easier for you to understand the code, test its correctness, and add new functionality.
On the other hand, if the code is automatically generated by another tool, you'll never need to read it or edit it. So what'd be the point in refactoring it?

I understand exactly where you're coming from, and can see exactly why you've structured your code the way it is, but it needs to change.
The uncertainty you feel when you attempt to refactor can be ameliorated by writing unit tests. If you've tests specific to each spec, then the code for each spec can be refactored until you're blue in the face, and you can have confidence in it.
A second option, is it possible to automatically generate your code from a data structure?
If you've a core suite of classes that do the donkey work and edge cases, you can auto-generate the repetitive 1000 line methods as often as you wish.
However, there are exceptions to every rule.
If the methods are a literal interpretation of the spec (very little additional logic), and the specs change infrequently, and the "common" portions (i.e. bits that happen to be the same right now) of the specs change at different times, and you're not going to be asked to get a 10x performance gain out of the code anytime soon, then (and only then) . . . you may be better off with what you have.
. . . but on the whole, refactor.

Yes, always. 1000 lines is at least 10x longer than any function should ever be, and I'm tempted to say 100x, except that when dealing with input parsing and validation it can become natural to write functions with 20 or so lines.
Edit: Just re-read your question and I'm not clear on one point - are you talking about machine generated code that no-one has to touch? In which case I would leave things as they are.

Refectoring is not the same as writing from scratch. While you should never write code like this, before you refactor it, you need to consider the costs of refactoring in terms of time spent, the associated risks in terms of breaking code that already works, and the net benefits in terms of future time saved. Refactor only if the net benefits outweigh the associated costs and risks.
Sometimes wrapping and rewriting can be a safer and more cost effective solution, even if it appears expensive at first glance.

Long methods need refactoring if they are maintained (and thus need to be understood) by humans.

As a rule of thumb, code for humans first. I don't agree with the common idea that functions need to be short. I think what you need to aim at is when a human reads your code they grok it quickly.
To this effect it's a good idea to simplify things as much as possible--but not more than that. It's a good idea to delegate roughly one task for each function. There is no rule as for what "roughly one task" means: you'll have to use your own judgement for that. But do recognize that a function split into too many other functions itself reduces readability. Think about the human being who reads your function for the first time: they would have to follow one function call after another, constantly context-switching and maintaining a stack in their mind. This is a task for machines, not for humans.
Find the balance.
Here, you see how important naming things is. You will see it is not that easy to choose names for variables and functions, it takes time, but on the other hand it can save a lot of confusion on the human reader's side. Again, find the balance between saving your time and the time of the friendly humans who will follow you.
As for repetition, it's a bad idea. It's something that needs to be fixed, just like a memory leak. It's a ticking bomb.
As others have said before me, changing code can be expensive. You need to do the thinking as for whether it will pay off to spend all this time and effort, facing the risks of change, for a better code. You will possibly lose lots of time and make yourself one headache after another now, in order to possibly save lots of time and headache later.

Take a look at the related question How many lines of code is too many?. There are quite a few tidbits of wisdom throughout the answers there.
To repost a quote (although I'll attempt to comment on it a little more here)... A while back, I read this passage from Ovid's journal:
I recently wrote some code for
Class::Sniff which would detect "long
methods" and report them as a code
smell. I even wrote a blog post about
how I did this (quelle surprise, eh?).
That's when Ben Tilly asked an
embarrassingly obvious question: how
do I know that long methods are a code
smell?
I threw out the usual justifications,
but he wouldn't let up. He wanted
information and he cited the excellent
book Code Complete as a
counter-argument. I got down my copy
of this book and started reading "How
Long Should A Routine Be" (page 175,
second edition). The author, Steve
McConnell, argues that routines should
not be longer than 200 lines. Holy
crud! That's waaaaaay to long. If a
routine is longer than about 20 or 30
lines, I reckon it's time to break it
up.
Regrettably, McConnell has the cheek
to cite six separate studies, all of
which found that longer routines were
not only not correlated with a greater
defect rate, but were also often
cheaper to develop and easier to
comprehend. As a result, the latest
version of Class::Sniff on github now
documents that longer routines may not
be a code smell after all. Ben was
right. I was wrong.
(The rest of the post, on TDD, is worth reading as well.)
Coming from the "shorter methods are better" camp, this gave me a lot to think about.
Previously my large methods were generally limited to "I need inlining here, and the compiler is being uncooperative", or "for one reason or another the giant switch block really does run faster than the dispatch table", or "this stuff is only called exactly in sequence and I really really don't want function call overhead here". All relatively rare cases.
In your situation, though, I'd have a large bias toward not touching things: refactoring carries some inherent risk, and it may currently outweigh the reward. (Disclaimer: I'm slightly paranoid; I'm usually the guy who ends up fixing the crashes.)
Consider spending your efforts on tests, asserts, or documentation that can strengthen the existing code and tilt the risk/reward scale before any attempt to refactor: invariant checks, bound function analysis, and pre/postcondition tests; any other useful concepts from DBC; maybe even a parallel implementation in another language (maybe something message oriented like Erlang would give you a better perspective, given your code sample) or even some sort of formal logical representation of the spec you're trying to follow if you have some time to burn.
Any of these kinds of efforts generally have a few results, even if you don't get to refactor the code: you learn something, you increase your (and your organization's) understanding of and ability to use the code and specifications, you might find a few holes that really do need to be filled now, and you become more confident in your ability to make a change with less chance of disastrous consequences.
As you gain a better understanding of the problem domain, you may find that there are different ways to refactor you hadn't thought of previously.
This isn't to say "thou shalt have a full-coverage test suite, and DBC asserts, and a formal logical spec". It's just that you are in a typically imperfect situation, and diversifying a bit -- looking for novel ways to approach the problems you find (maintainability? fuzzy spec? ease of learning the system?) -- may give you a small bit of forward progress and some increased confidence, after which you can take larger steps.
So think less from the "too many lines is a problem" perspective and more from the "this might be a code smell, what problems is it going to cause for us, and is there anything easy and/or rewarding we can do about it?"
Leaving it cooking on the backburner for a bit -- coming back and revisiting it as time and coincidence allows (e.g. "I'm working near the code today, maybe I'll wander over and see if I can't document the assumptions a bit better...") may produce good results. Then again, getting royally ticked off and deciding something must be done about the situation is also effective.
Have I managed to be wishy-washy enough here? My point, I think, is that the code smells, the patterns/antipatterns, the best practices, etc -- they're there to serve you. Experiment to get used to them, and then take what makes sense for your current situation, and leave the rest.

I think you first need to "refactor" the specs. If there are repetitions in the spec it also will become easier to read, if it makes use of some "basic building blocks".
Edit: As long as you cannot refactor the specs, I wouldn't change the code.
Coding style guides are all made for easier code maintenance, but in your special case the ease of maintenance is achieved by following the spec.
Some people here asked if the code is generated. In my opinion it does not matter: If the code follows the spec "line by line" it makes no difference if the code is generated or hand-written.

1000 thousand lines of code is nothing. We have functions that are 6 to 12 thousand lines long. Of course those functions are so big, that literally things get lost in there, and no tool can help us even look at high level abstractions of them. the code is now unfortunately incomprehensible.
My opinion of functions that are that big, is that they were not written by brilliant programmers but by incompetent hacks who shouldn't be left anywhere near a computer - but should be fired and left flipping burgers at McDonald's. Such code wreaks havok by leaving behind features that cannot be added to or improved upon. (too bad for the customer). The code is so brittle that it cannot be modified by anyone - even the original authors.
And yes, those methods should be refactored, or thrown away.

Do you ever have to read or maintain the generated code?
If yes, then I'd think some refactoring might be in order.
If no, then the higher-level language is really the language you're working with -- the C++ is just an intermediate representation on the way to the compiler -- and refactoring might not be necessary.

Looks to me that you've implemented a separate language within your application - have you considered going that way?

It has been my understanding that it's recommended that any method over 100 lines of code be refactored.

I think some rules may be a little different in his era when code is most commonly viewed in an IDE. If the code does not contain exploitable repetition, such that there are 1,000 lines which are going to be referenced once each, and which share a significant number of variables in a clear fashion, dividing the code into 100-line routines each of which is called once may not be that much of an improvement over having a well-formatted 1,000-line module which includes #region tags or the equivalent to allow outline-style viewing.
My philosophy is that certain layouts of code generally imply certain things. To my mind, when a piece of code is placed into its own routine, that suggests that the code will be usable in more than one context (exception: callback handlers and the like in languages which don't support anonymous methods). If code segment #1 leaves an object in an obscure state which is only usable by code segment #2, and code segment #2 is only usable on a data object which is left in the state created by #1, then absent some compelling reason to put the segments in different routines, they should appear in the same routine. If a program puts objects through a chain of obscure states extending for many hundreds of lines of code, it might be good to rework the design of the code to subdivide the operation into smaller pieces which have more "natural" pre- and post- conditions, but absent some compelling reason to do so, I would not favor splitting up the code without changing the design.

For further reading, I highly recommend the long, insightful, entertaining, and sometimes bitter discussion of this topic over on the Portland Pattern Repository.

I've seen cases where it is not the case (for example, creating an Excel spreadsheet in .Net often requires a lot of line of code for the formating of the sheet), but most of the time, the best thing would be to indeed refactor it.
I personally try to make a function small enough so it all appears on my screen (without affecting the readability of course).

1000 lines? Definitely they need to be refactored. Also not that, for example, default maximum number of executable statements is 30 in Checkstyle, well-known coding standard checker.

If you refactor, when you refactor, add some comments to explain what the heck it's doing.
If it had comments, it would be much less likely a candidate for refactoring, because it would already be easier to read and follow for someone starting from scratch.

Then here the question: in general, do
you think such very long methods would
always need refactoring,
if you ask in general, we will say Yes .
or in a
similar case it would be acceptable?
(unfortunately refactoring the specs
is not an option)
Sometimes are acceptable, but is very unusual, I will give you a pair of examples:
There are some 8 bit microcontrollers called Microchip PIC, that have only a fixed 8 level stack, so you can't nest more than 8 calls, then care must be taken to avoid "stack overflow", so in this special case having many small function (nested) is not the best way to go.
Other example is when doing optimization of code (at very low level) so you have to take account the jump and context saving cost. Use it with care.
EDIT:
Even in generated code, you could need to refactorize the way its generated, for example for memory saving, energy saving, generate human readable, beauty, who knows, etc..

There has been very good general advise, here a practical recommendation for your sample:
common patterns can be isolated in plain feeder methods:
void AddSimpleTransform(OutMsg & msg, InMsg const & inMsg,
int rotateBy, int foldBy, int gonkBy = 0)
{
// create & add up to three messages
}
You might even improve that by making this a member of OutMsg, and using a fluent interface, such that you can write
OutMsg msg;
msg.AddSimpleTransform(inMsg, 12, 17)
.Staple("print")
.AddArtificialRust(0.02);
which can be an additional improvement under circumstances.

Testing approach for algorithms with complex outputs

How to test a result of a program that is basically a black box? For example one year ago I had to write a B tree as a homework and I really struggled with testing the correctness. What strategies do you use in such scenarios? Visualization? Robust input-->result sets of testing data? What do you do when it is hard to get such data because the only way how to get them is your proper working program?
EDIT: I think that my question was misunderstood. There was no problem with understanding how B tree works. That is trivial. But writing robust tests for validating its proper functionality is not so trivial. I think that this school problem is similar to many practical REAL word scenarios and test cases. And sometimes understanding the domain is quite different from delivering working and correct program...
EDIT2: And yes, with B tree it is possible to validate proper behavior with pen and paper. But this is really dirty and not fun :) This is not working well with problems that requires huge amount of data for their validation...

I'm not sure these answers really capture the problem at hand. A B-tree's input and output aren't any different from those of any other dictionary---but the algorithm performs better, if it's implemented correctly. It's only really got two functions to test (add, and find) so theoretically, "black-box" testing of this single component should be fine. Designing for testability isn't the issue, since no matter how you do it the whole algorithm will be one component.
So the question is: when you have to implement subtle algorithms, the kinds with complicated output that you can't always understand in your head so well, how do you test them? I think there are three different strategies you can use:
Black-box test basic functionality. For the B-tree case, this is things like cwash suggested, and also, things like making sure that when you add an item, you can then find it, etc.
Test certain invariants that your algorithm should maintain (the B-tree should be balanced, values within nodes should be sorted, etc.)
A few, small "pencil-and-paper" tests may be necessary -- work the algorithm out by hand and check that it matches what your code does. But the big-data tests can all be of type 2. These can also be brittle, so unless you need to be really sure about your algorithm, you may want to avoid them.

If you do not grasp the problem at hand, how can you develop a solution to it? My suggestion would be to understand the domain enough to be able to work out the problem on paper and ensure that your program matches.

Consult with an expert on the subject.
I know if I have a convoluted procedure I'm trying to fix, I have no idea what the output should be after my changes, so I need to consult a fellow developer with more knowledge of the business need, and they are able to verify what I've done is correct.

I would focus on constructing test cases that exercise the functionality of your B-tree algorithm. I haven't looked at it for years, but I'm fairly sure you'll be able to find a documented sequence of steps to insert a set of values in a specific order, then validate that the leaf nodes are as they should be. If you construct your testing along these lines, you should be able to prove your implementation is correct.

The key is to know there is a balance between testing something to death and doing tests that adequately cover what should be covered. Edge cases, e.g null inputs or checking inputs are numeric by testing an alphabet character or a punctuation character, are likely most of the tests you'd need. To complement this there may be one or two common cases to handle to show the program can handle a non-edge case as well. To cover all valid input in most programs is overkill and would result in an overwhelmingly large amount of tests.

I think the answer to the question you're asking boils down to designing for testability. Often you get a testable design for free when you test-drive the development of the solution. But let's face it, when you're implementing a highly mathematical algorithm, this just doesn't fall out.
To make sure you have a testable design, you need to understand what a seam is. Then you need to know a few rules of thumb, such as avoiding statics, using polymorphism, and properly decomposing problems and separating concerns.
Watch "The Clean Code Talks -- Unit Testing" by Misko Hevery, I think it will help you wrap your head around it.

Try looking at it from a requirements point of view, rather than an implementation point of view. Before you write code, you must understand exactly what you want it to do.
Testing and requirements should be a matching pair. If you're having trouble defining tests, maybe it's because the requirements are not well-defined. That in turn implies that you may have bugs that aren't so much implementation bugs, but "lack of clear requirements" bugs. The code writer in that case would be working to a mental list of requirements that he/she thinks is requirements, but can't be sure, and they're not written down for independent understanding and verification.
I've struggled with software where the requirements weren't clear, because the customer couldn't even tell us what they wanted. But when we delivered to them, they sure could tell us then what they didn't like about it! A big part of software engineering is getting the requirements right before the coding begins. This is true on the high-level (overall product, with requirement input from customer) and also the smaller level (modules, individual functions, where requirements are internally defined by software team or individuals). It is still true to some degree I think for iterative development, although the high-level requirements are more fluid.

#Bystrik Jurina,
I often get involved in projects which involve conversions between disparate data formats. Most answers have focused on testing a B-tree or similar algorithm, but it seems that you're looking for a more general answer.
Most of my work is based on the command line. It may sounds like a contradiction, but one of the first tools I use is visualization. I'll write some methods to write out my data structures in a format that's easy to consume. This can (and usually does) include something that's visually clear. But often it also means something that I could easily parse with a smaller test program, or even import into Excel.
I'll start by focusing on the basic outline, and write a program that does the bare minimum of what I need to accomplish. If it's a multi-step process, this might mean implementing one step at a time and validating the results of each step before moving on. Or writing something that works only in specific cases, and then expanding the set of cases where it's expected to work. At first you can validate that the code works in the limited set of cases, such as for known input data. As the project moves forward, you can start logging warnings for cases you might not have tested, or for unexpected types of input data. This has drawbacks, but is a nice approach when you're dealing with a known set of input data
Validation techniques can include formal test cases, or informal programs that work to challenge your assumptions. It could mean writing a basic driver program to exercise the "core" routines. A good example would be to add a record to a database, then read it back and compare the original object against the one loaded from the database.
If you have trouble wrapping your head around the way a program functions, think about what it needs to accomplish. It might be easier to writing code that tests the way different inputs produce different outputs. Producing visualizations is a good help, because the act of deciding how to display the data can make you think about different conditions and focus in on the most critical parts of your data structures.
Often I've found that building a visualization brings me to admit that the way the data is being stored just isn't very clear. For a B-tree, the representation isn't very flexible. But for other cases, you may be using parallel arrays when a nested tree of objects would be more natural.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js