How to manage special cases and heuristics

How to manage special cases and heuristics - c++

I often have code based on a specific well defined algorithm. This gets well commented and seems proper. For most data sets, the algorithm works great.
But then the edge cases, the special cases, the heuristics get added to solve particular problems with particular sets of data. As number of special cases grow, the comments get more and more hazy. I fear going back and looking at this code in a year or so and trying to remember why each particular special case or heuristic was added.
I sometimes wish there was a way to embed or link graphics in the source code, so I could say effectively, "in the graph of this data set, this particular feature here was causing the routine to trigger incorrectly, so that's why this piece of code was added".
What are some best-practices to handle situations like this?
Special cases seem to be always required to handle these unusual/edge cases. How can they be managed to keep the code relatively readable and understandable?
Consider an example dealing with feature recognition from photos (not exactly what I'm working on, but the analogy seems apt). When I find a particular picture for which the general algorithm fails and a special case is needed, I record as best I can that information in a comment, (or as someone suggested below, a descriptive function name). But what is often missing is a permanent link to the particular data file that exhibits the behavior in question. While my comment should describe the issue, and would probably say "see file foo.jp for an example of this behavior", this file is never in the source tree, and can easily get lost.
In cases like this, do people add data files to the source tree for reference?

Martin Fowler said in his refactoring book that when you feel the need to add a comment to your code, first see if you can encapsulate that code into a method and give the method a name that would replace the comment.
so as an abstract you could create a method named.
private bool ConditionXAndYHaveOccurred(object param)
{
// code to check for conditions x and y
return result;
}
private object ApplySolutionForEdgeCaseWhenXAndYHappen(object param)
{
//modify param to solve for edge case
return param;
}
Then you can write code like
if(ConditionXAndYHaveOccurred(myObject))
{
myObject = ApplySolutionForEdgeCaseWhenXAndYHappen(myObject);
}
Not a hard and fast concrete example, but it would help with readability in a year or two.

Unit testing can help here. Having tests that actually simulate the special cases can often serve as documentation on why the code does what it does. This can often be better then just describing the issue in a comment.
Not that this replaces moving the special case handling to their own functions and decent comments...

If you have a knowledge base or a wiki for the project, you could add the graph in it, linking to it in the method as per Matthew's Fowler quote and also in the source control commit message for the edge case change.
//See description at KB#2312
private object SolveXAndYEdgeCase(object param)
{
//modify param to solve for edge case
return param;
}
Commit Message: Solution for X and Y edge case, see description at KB#2312
It is more work, but a way to document cases more thoroughly than mere test cases or comments could. Even though one might argue that test cases should be documentation enough, you might not want store the whole failing data set in it, for instance.
Remember, vague problems lead to vague solutions.

I'm not usually an advocate of test driven development and similar styles that stress tests too much, but this seems to be a perfect case where a bunch of unit test can help a lot. And not even in the first place to catch bugs from later changes, but simply to document all the special cases that need to be addressed.
A few good unit test with comments in them are in itself the best description of the special cases. And the commenting of the code itself gets easier too. One can simply point to some unit tests that illustrate the problem that is being solved at that point in the code.

About the
I sometimes wish there was a way to embed or link graphics in the source code,
so I could say effectively, "in the graph of this data set, this particular
feature here was causing the routine to trigger incorrectly, so that's why this piece of
code was added".
part:
If the "graphic" that you want to embed is a graph, and if you use Doxygen, you can embed dot commands in your comment to generate a graph in the documentation:
/**
If we have a subgraph looking like this:
\dot
digraph g{
A->B;
A->C;
B->C;
}
\enddot
the usual method does not work well and we use this heuristic instead.
*/

Don Knuth invented literate programming to make it easy for your program documentation to include plots, graphs, charts, mathematical equations, and whatever else you need to make it understood. A literate program is a great way to explain why something is the way it is and how it got that way over time. There are many, many literate-programming tools; the "noweb" tool is one of the simplest and is shipped with some Linux distributions.

Without knowing the specific nature of your problem is not easy to give an answer, but in my own experience, handling of special cases on hard code must be avoided. Haven't you thought about implementing a rules engine or something like that for handling special cases outside your main processing algorithm?

It sounds like you need more thorough documentation than just code comments. That way someone could look up the function in question in the documentation and be presented with an example picture that requires a special case.

Related

Do very long methods always need refactoring?

I face a situation where we have many very long methods, 1000 lines or more.
To give you some more detail, we have a list of incoming high level commands, and each generates results in a longer (sometime huge) list of lower level commands. There's a factory creating an instance of a class for each incoming command. Each class has a process method, where all the lower level commands are generated added in sequence. As I said, these sequences of commands and their parameters cause quite often the process methods to reach thousands of lines.
There are a lot of repetitions. Many command patterns are shared between different commands, but the code is repeated over and over. That leads me to think refactoring would be a very good idea.
On the contrary, the specs we have come exactly in the same form as the current code. Very long list of commands for each incoming one. When I've tried some refactoring, I've started to feel uncomfortable with the specs. I miss the obvious analogy between the specs and code, and lose time digging into newly created common classes.
Then here the question: in general, do you think such very long methods would always need refactoring, or in a similar case it would be acceptable?
(unfortunately refactoring the specs is not an option)
edit:
I have removed every reference to "generate" cause it was actually confusing. It's not auto generated code.
class InCmd001 {
OutMsg process ( InMsg& inMsg ) {
OutMsg outMsg = OutMsg::Create();
OutCmd001 outCmd001 = OutCmd001::Create();
outCmd001.SetA( param.getA() );
outCmd001.SetB( inMsg.getB() );
outMsg.addCmd( outCmd001 );
OutCmd016 outCmd016 = OutCmd016::Create();
outCmd016.SetF( param.getF() );
outMsg.addCmd( outCmd016 );
OutCmd007 outCmd007 = OutCmd007::Create();
outCmd007.SetR( inMsg.getR() );
outMsg.addCmd( outCmd007 );
// ......
return outMsg;
}
}
here the example of one incoming command class (manually written in pseudo c++)

Code never needs refactoring. The code either works, or it doesn't. And if it works, the code doesn't need anything.
The need for refactoring comes from you, the programmer. The person reading, writing, maintaining and extending the code.
If you have trouble understanding the code, it needs to be refactored. If you would be more productive by cleaning up and refactoring the code, it needs to be refactored.
In general, I'd say it's a good idea for your own sake to refactor 1000+ line functions. But you're not doing it because the code needs it. You're doing it because that makes it easier for you to understand the code, test its correctness, and add new functionality.
On the other hand, if the code is automatically generated by another tool, you'll never need to read it or edit it. So what'd be the point in refactoring it?

I understand exactly where you're coming from, and can see exactly why you've structured your code the way it is, but it needs to change.
The uncertainty you feel when you attempt to refactor can be ameliorated by writing unit tests. If you've tests specific to each spec, then the code for each spec can be refactored until you're blue in the face, and you can have confidence in it.
A second option, is it possible to automatically generate your code from a data structure?
If you've a core suite of classes that do the donkey work and edge cases, you can auto-generate the repetitive 1000 line methods as often as you wish.
However, there are exceptions to every rule.
If the methods are a literal interpretation of the spec (very little additional logic), and the specs change infrequently, and the "common" portions (i.e. bits that happen to be the same right now) of the specs change at different times, and you're not going to be asked to get a 10x performance gain out of the code anytime soon, then (and only then) . . . you may be better off with what you have.
. . . but on the whole, refactor.

Yes, always. 1000 lines is at least 10x longer than any function should ever be, and I'm tempted to say 100x, except that when dealing with input parsing and validation it can become natural to write functions with 20 or so lines.
Edit: Just re-read your question and I'm not clear on one point - are you talking about machine generated code that no-one has to touch? In which case I would leave things as they are.

Refectoring is not the same as writing from scratch. While you should never write code like this, before you refactor it, you need to consider the costs of refactoring in terms of time spent, the associated risks in terms of breaking code that already works, and the net benefits in terms of future time saved. Refactor only if the net benefits outweigh the associated costs and risks.
Sometimes wrapping and rewriting can be a safer and more cost effective solution, even if it appears expensive at first glance.

Long methods need refactoring if they are maintained (and thus need to be understood) by humans.

As a rule of thumb, code for humans first. I don't agree with the common idea that functions need to be short. I think what you need to aim at is when a human reads your code they grok it quickly.
To this effect it's a good idea to simplify things as much as possible--but not more than that. It's a good idea to delegate roughly one task for each function. There is no rule as for what "roughly one task" means: you'll have to use your own judgement for that. But do recognize that a function split into too many other functions itself reduces readability. Think about the human being who reads your function for the first time: they would have to follow one function call after another, constantly context-switching and maintaining a stack in their mind. This is a task for machines, not for humans.
Find the balance.
Here, you see how important naming things is. You will see it is not that easy to choose names for variables and functions, it takes time, but on the other hand it can save a lot of confusion on the human reader's side. Again, find the balance between saving your time and the time of the friendly humans who will follow you.
As for repetition, it's a bad idea. It's something that needs to be fixed, just like a memory leak. It's a ticking bomb.
As others have said before me, changing code can be expensive. You need to do the thinking as for whether it will pay off to spend all this time and effort, facing the risks of change, for a better code. You will possibly lose lots of time and make yourself one headache after another now, in order to possibly save lots of time and headache later.

Take a look at the related question How many lines of code is too many?. There are quite a few tidbits of wisdom throughout the answers there.
To repost a quote (although I'll attempt to comment on it a little more here)... A while back, I read this passage from Ovid's journal:
I recently wrote some code for
Class::Sniff which would detect "long
methods" and report them as a code
smell. I even wrote a blog post about
how I did this (quelle surprise, eh?).
That's when Ben Tilly asked an
embarrassingly obvious question: how
do I know that long methods are a code
smell?
I threw out the usual justifications,
but he wouldn't let up. He wanted
information and he cited the excellent
book Code Complete as a
counter-argument. I got down my copy
of this book and started reading "How
Long Should A Routine Be" (page 175,
second edition). The author, Steve
McConnell, argues that routines should
not be longer than 200 lines. Holy
crud! That's waaaaaay to long. If a
routine is longer than about 20 or 30
lines, I reckon it's time to break it
up.
Regrettably, McConnell has the cheek
to cite six separate studies, all of
which found that longer routines were
not only not correlated with a greater
defect rate, but were also often
cheaper to develop and easier to
comprehend. As a result, the latest
version of Class::Sniff on github now
documents that longer routines may not
be a code smell after all. Ben was
right. I was wrong.
(The rest of the post, on TDD, is worth reading as well.)
Coming from the "shorter methods are better" camp, this gave me a lot to think about.
Previously my large methods were generally limited to "I need inlining here, and the compiler is being uncooperative", or "for one reason or another the giant switch block really does run faster than the dispatch table", or "this stuff is only called exactly in sequence and I really really don't want function call overhead here". All relatively rare cases.
In your situation, though, I'd have a large bias toward not touching things: refactoring carries some inherent risk, and it may currently outweigh the reward. (Disclaimer: I'm slightly paranoid; I'm usually the guy who ends up fixing the crashes.)
Consider spending your efforts on tests, asserts, or documentation that can strengthen the existing code and tilt the risk/reward scale before any attempt to refactor: invariant checks, bound function analysis, and pre/postcondition tests; any other useful concepts from DBC; maybe even a parallel implementation in another language (maybe something message oriented like Erlang would give you a better perspective, given your code sample) or even some sort of formal logical representation of the spec you're trying to follow if you have some time to burn.
Any of these kinds of efforts generally have a few results, even if you don't get to refactor the code: you learn something, you increase your (and your organization's) understanding of and ability to use the code and specifications, you might find a few holes that really do need to be filled now, and you become more confident in your ability to make a change with less chance of disastrous consequences.
As you gain a better understanding of the problem domain, you may find that there are different ways to refactor you hadn't thought of previously.
This isn't to say "thou shalt have a full-coverage test suite, and DBC asserts, and a formal logical spec". It's just that you are in a typically imperfect situation, and diversifying a bit -- looking for novel ways to approach the problems you find (maintainability? fuzzy spec? ease of learning the system?) -- may give you a small bit of forward progress and some increased confidence, after which you can take larger steps.
So think less from the "too many lines is a problem" perspective and more from the "this might be a code smell, what problems is it going to cause for us, and is there anything easy and/or rewarding we can do about it?"
Leaving it cooking on the backburner for a bit -- coming back and revisiting it as time and coincidence allows (e.g. "I'm working near the code today, maybe I'll wander over and see if I can't document the assumptions a bit better...") may produce good results. Then again, getting royally ticked off and deciding something must be done about the situation is also effective.
Have I managed to be wishy-washy enough here? My point, I think, is that the code smells, the patterns/antipatterns, the best practices, etc -- they're there to serve you. Experiment to get used to them, and then take what makes sense for your current situation, and leave the rest.

I think you first need to "refactor" the specs. If there are repetitions in the spec it also will become easier to read, if it makes use of some "basic building blocks".
Edit: As long as you cannot refactor the specs, I wouldn't change the code.
Coding style guides are all made for easier code maintenance, but in your special case the ease of maintenance is achieved by following the spec.
Some people here asked if the code is generated. In my opinion it does not matter: If the code follows the spec "line by line" it makes no difference if the code is generated or hand-written.

1000 thousand lines of code is nothing. We have functions that are 6 to 12 thousand lines long. Of course those functions are so big, that literally things get lost in there, and no tool can help us even look at high level abstractions of them. the code is now unfortunately incomprehensible.
My opinion of functions that are that big, is that they were not written by brilliant programmers but by incompetent hacks who shouldn't be left anywhere near a computer - but should be fired and left flipping burgers at McDonald's. Such code wreaks havok by leaving behind features that cannot be added to or improved upon. (too bad for the customer). The code is so brittle that it cannot be modified by anyone - even the original authors.
And yes, those methods should be refactored, or thrown away.

Do you ever have to read or maintain the generated code?
If yes, then I'd think some refactoring might be in order.
If no, then the higher-level language is really the language you're working with -- the C++ is just an intermediate representation on the way to the compiler -- and refactoring might not be necessary.

Looks to me that you've implemented a separate language within your application - have you considered going that way?

It has been my understanding that it's recommended that any method over 100 lines of code be refactored.

I think some rules may be a little different in his era when code is most commonly viewed in an IDE. If the code does not contain exploitable repetition, such that there are 1,000 lines which are going to be referenced once each, and which share a significant number of variables in a clear fashion, dividing the code into 100-line routines each of which is called once may not be that much of an improvement over having a well-formatted 1,000-line module which includes #region tags or the equivalent to allow outline-style viewing.
My philosophy is that certain layouts of code generally imply certain things. To my mind, when a piece of code is placed into its own routine, that suggests that the code will be usable in more than one context (exception: callback handlers and the like in languages which don't support anonymous methods). If code segment #1 leaves an object in an obscure state which is only usable by code segment #2, and code segment #2 is only usable on a data object which is left in the state created by #1, then absent some compelling reason to put the segments in different routines, they should appear in the same routine. If a program puts objects through a chain of obscure states extending for many hundreds of lines of code, it might be good to rework the design of the code to subdivide the operation into smaller pieces which have more "natural" pre- and post- conditions, but absent some compelling reason to do so, I would not favor splitting up the code without changing the design.

For further reading, I highly recommend the long, insightful, entertaining, and sometimes bitter discussion of this topic over on the Portland Pattern Repository.

I've seen cases where it is not the case (for example, creating an Excel spreadsheet in .Net often requires a lot of line of code for the formating of the sheet), but most of the time, the best thing would be to indeed refactor it.
I personally try to make a function small enough so it all appears on my screen (without affecting the readability of course).

1000 lines? Definitely they need to be refactored. Also not that, for example, default maximum number of executable statements is 30 in Checkstyle, well-known coding standard checker.

If you refactor, when you refactor, add some comments to explain what the heck it's doing.
If it had comments, it would be much less likely a candidate for refactoring, because it would already be easier to read and follow for someone starting from scratch.

Then here the question: in general, do
you think such very long methods would
always need refactoring,
if you ask in general, we will say Yes .
or in a
similar case it would be acceptable?
(unfortunately refactoring the specs
is not an option)
Sometimes are acceptable, but is very unusual, I will give you a pair of examples:
There are some 8 bit microcontrollers called Microchip PIC, that have only a fixed 8 level stack, so you can't nest more than 8 calls, then care must be taken to avoid "stack overflow", so in this special case having many small function (nested) is not the best way to go.
Other example is when doing optimization of code (at very low level) so you have to take account the jump and context saving cost. Use it with care.
EDIT:
Even in generated code, you could need to refactorize the way its generated, for example for memory saving, energy saving, generate human readable, beauty, who knows, etc..

There has been very good general advise, here a practical recommendation for your sample:
common patterns can be isolated in plain feeder methods:
void AddSimpleTransform(OutMsg & msg, InMsg const & inMsg,
int rotateBy, int foldBy, int gonkBy = 0)
{
// create & add up to three messages
}
You might even improve that by making this a member of OutMsg, and using a fluent interface, such that you can write
OutMsg msg;
msg.AddSimpleTransform(inMsg, 12, 17)
.Staple("print")
.AddArtificialRust(0.02);
which can be an additional improvement under circumstances.

How simple is 'too simple to break'? - explained

In JUnit FAQ you can read that you shouldn't test methods that are too simple to break. While all examples seem logical (getters and setters, delegation etc.), I'm not sure I am able to grasp the "can't break on its own" concept in full. When would you say that the method "can't break on its own"? Anyone care to elaborate?

I think "can't break on its own" means that the method only uses elements of its own class, and does not depend upon the behavior of any other objects/classes, or that it delegates all of its functionality to some other method or class (which presumably has its own tests).
The basic idea is that if you can see everything the method does, without needing to refer to other methods or classes, and you are pretty sure it is correct, then a test is probably not necessary.
There is not necessarily a clear line here. "Too simple to break" is in the eye of the beholder.

Try thinking of it this way. You're not really testing methods. You're describing some behaviour and giving some examples of how to use it, so that other people (including your later self) can come and change that behaviour safely later. The examples happen to be executable.
If you think that people can change your code safely, you don't need to worry.

No matter how simple a method is, it can still be wrong. For example you might have two similarly named variables and access the wrong one. However, these errors will likely be quickly found and once these methods are written correctly, they are going to stay correct and so it is not worthwhile permanently keeping around a test for this. Rather than "too simple to break", I would recommend considering whether it is too simple to be worth keeping a permanent test.

Put it this way, you're building a wood table.
You'll test things that may fail. For instance, putting a jar in the table, or sitting over the table, or pushing the table from one side to another in the room, etc. You're testing table, in a way you know it is somehow vulnerable or at least in a way you know you'll use it.
You don't test though, nails, or one of its legs, because they are "too simple to break on its own".
Same goes for unit testing, you don't test getters/setters, because the only way they may fail, it because the runtime environment fail. You don't test methods that forward the message to other methods, because they are too simple to break on it own, you better test the referenced method.
I hope this helps.

If you are using TDD, that advice is wrong. In the TDD case your function exists because the test failed. It doesn't matter if the function is simple or not.
But if you are adding tests afterwards to some existing code, I can sort of understand the argument that you shouldn't need to test code that cannot break. But I still think that is just an excuse for not having to write tests. Also ask yourself: if that piece of code is not worthy of testing, then maybe that code is not needed at all?

I like the risk based approach of GAMP 5 which basically means (in this context) to first asses the various possible risks of a software and only define tests for the higher-risk parts.
Although this applies to GxP environments, it can be adapted in the way: How likely is a certain class to have erroneous methods, and how big is the impact an error would have? E.g., if a method decides whether to give a user access to a resource, you must of course test that extensively enough.
That means in deterimining where to draw the line between "too simple to break" it can help to take into consideration the possible consequences of a potential flaw.

Testing approach for algorithms with complex outputs

How to test a result of a program that is basically a black box? For example one year ago I had to write a B tree as a homework and I really struggled with testing the correctness. What strategies do you use in such scenarios? Visualization? Robust input-->result sets of testing data? What do you do when it is hard to get such data because the only way how to get them is your proper working program?
EDIT: I think that my question was misunderstood. There was no problem with understanding how B tree works. That is trivial. But writing robust tests for validating its proper functionality is not so trivial. I think that this school problem is similar to many practical REAL word scenarios and test cases. And sometimes understanding the domain is quite different from delivering working and correct program...
EDIT2: And yes, with B tree it is possible to validate proper behavior with pen and paper. But this is really dirty and not fun :) This is not working well with problems that requires huge amount of data for their validation...

I'm not sure these answers really capture the problem at hand. A B-tree's input and output aren't any different from those of any other dictionary---but the algorithm performs better, if it's implemented correctly. It's only really got two functions to test (add, and find) so theoretically, "black-box" testing of this single component should be fine. Designing for testability isn't the issue, since no matter how you do it the whole algorithm will be one component.
So the question is: when you have to implement subtle algorithms, the kinds with complicated output that you can't always understand in your head so well, how do you test them? I think there are three different strategies you can use:
Black-box test basic functionality. For the B-tree case, this is things like cwash suggested, and also, things like making sure that when you add an item, you can then find it, etc.
Test certain invariants that your algorithm should maintain (the B-tree should be balanced, values within nodes should be sorted, etc.)
A few, small "pencil-and-paper" tests may be necessary -- work the algorithm out by hand and check that it matches what your code does. But the big-data tests can all be of type 2. These can also be brittle, so unless you need to be really sure about your algorithm, you may want to avoid them.

If you do not grasp the problem at hand, how can you develop a solution to it? My suggestion would be to understand the domain enough to be able to work out the problem on paper and ensure that your program matches.

Consult with an expert on the subject.
I know if I have a convoluted procedure I'm trying to fix, I have no idea what the output should be after my changes, so I need to consult a fellow developer with more knowledge of the business need, and they are able to verify what I've done is correct.

I would focus on constructing test cases that exercise the functionality of your B-tree algorithm. I haven't looked at it for years, but I'm fairly sure you'll be able to find a documented sequence of steps to insert a set of values in a specific order, then validate that the leaf nodes are as they should be. If you construct your testing along these lines, you should be able to prove your implementation is correct.

The key is to know there is a balance between testing something to death and doing tests that adequately cover what should be covered. Edge cases, e.g null inputs or checking inputs are numeric by testing an alphabet character or a punctuation character, are likely most of the tests you'd need. To complement this there may be one or two common cases to handle to show the program can handle a non-edge case as well. To cover all valid input in most programs is overkill and would result in an overwhelmingly large amount of tests.

I think the answer to the question you're asking boils down to designing for testability. Often you get a testable design for free when you test-drive the development of the solution. But let's face it, when you're implementing a highly mathematical algorithm, this just doesn't fall out.
To make sure you have a testable design, you need to understand what a seam is. Then you need to know a few rules of thumb, such as avoiding statics, using polymorphism, and properly decomposing problems and separating concerns.
Watch "The Clean Code Talks -- Unit Testing" by Misko Hevery, I think it will help you wrap your head around it.

Try looking at it from a requirements point of view, rather than an implementation point of view. Before you write code, you must understand exactly what you want it to do.
Testing and requirements should be a matching pair. If you're having trouble defining tests, maybe it's because the requirements are not well-defined. That in turn implies that you may have bugs that aren't so much implementation bugs, but "lack of clear requirements" bugs. The code writer in that case would be working to a mental list of requirements that he/she thinks is requirements, but can't be sure, and they're not written down for independent understanding and verification.
I've struggled with software where the requirements weren't clear, because the customer couldn't even tell us what they wanted. But when we delivered to them, they sure could tell us then what they didn't like about it! A big part of software engineering is getting the requirements right before the coding begins. This is true on the high-level (overall product, with requirement input from customer) and also the smaller level (modules, individual functions, where requirements are internally defined by software team or individuals). It is still true to some degree I think for iterative development, although the high-level requirements are more fluid.

#Bystrik Jurina,
I often get involved in projects which involve conversions between disparate data formats. Most answers have focused on testing a B-tree or similar algorithm, but it seems that you're looking for a more general answer.
Most of my work is based on the command line. It may sounds like a contradiction, but one of the first tools I use is visualization. I'll write some methods to write out my data structures in a format that's easy to consume. This can (and usually does) include something that's visually clear. But often it also means something that I could easily parse with a smaller test program, or even import into Excel.
I'll start by focusing on the basic outline, and write a program that does the bare minimum of what I need to accomplish. If it's a multi-step process, this might mean implementing one step at a time and validating the results of each step before moving on. Or writing something that works only in specific cases, and then expanding the set of cases where it's expected to work. At first you can validate that the code works in the limited set of cases, such as for known input data. As the project moves forward, you can start logging warnings for cases you might not have tested, or for unexpected types of input data. This has drawbacks, but is a nice approach when you're dealing with a known set of input data
Validation techniques can include formal test cases, or informal programs that work to challenge your assumptions. It could mean writing a basic driver program to exercise the "core" routines. A good example would be to add a record to a database, then read it back and compare the original object against the one loaded from the database.
If you have trouble wrapping your head around the way a program functions, think about what it needs to accomplish. It might be easier to writing code that tests the way different inputs produce different outputs. Producing visualizations is a good help, because the act of deciding how to display the data can make you think about different conditions and focus in on the most critical parts of your data structures.
Often I've found that building a visualization brings me to admit that the way the data is being stored just isn't very clear. For a B-tree, the representation isn't very flexible. But for other cases, you may be using parallel arrays when a nested tree of objects would be more natural.

Testing When Correctness is Poorly Defined?

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions.
However, a lot of the code I write is data mining code that basically looks for significant patterns in large datasets. Correct behavior in this case is often not well defined and depends on a lot of different inputs in ways that are not easy for a human to predict (i.e. the math can't reasonably be done by hand, which is why I'm using a computer to solve the problem in the first place). These inputs can be very complex, to the point where coming up with a reasonable test case is near impossible. Identifying the edge cases that are worth testing is extremely difficult. Sometimes the algorithm isn't even deterministic.
Usually, I do the best I can by using asserts for sanity checks and creating a small toy test case with a known pattern and informally seeing if the answer at least "looks reasonable", without it necessarily being objectively correct. Is there any better way to test these kinds of cases?

I think you just need to write unit tests based on small sets of data that will make sure that your code is doing exactly what you want it to do. If this gives you a reasonable data-mining algorithm is a separate issue, and I don't think it is possible to solve it by unit tests. There are two "levels" of correctness of your code:
Your code is correctly implementing the given data mining algorithm (this thing you should unit-test)
The data mining algorithm you implement is "correct" - solves the business problem. This is a quite open question, it probably depends both on some parameters of your algorithm as well as on the actual data (different algorithms work for different types of data).

When facing cases like this I tend to build one or more stub data sets that reflect the proper underlying complexities of the real-life data. I often do this together with the customer, to make sure I capture the essence of the complexities.
Then I can just codify these into one or more datasets that can be used as basis for making very specific unit tests (sometimes they're more like integration tests with stub data, but I don't think that's an important distinction). So while your algorithm may have "fuzzy" results for a "generic" dataset, these algorithms almost always have a single correct answer for a specific dataset.

Well, there are a few answers.
First of all, as you mentioned, take a small case study, and do the math by hand. Since you wrote the algorithm, you know what it's supposed to do, so you can do it in a limited case.
The other one is to break down every component of your program into testable parts.
If A calls B calls C calls D, and you know that A,B,C,D, all give the right answer, then you test A->B, B->C, and C->D, then you can be reasonably sure that A->D is giving the correct response.
Also, if there are other programs out there that do what you are looking to do, try and aquire their datasets. Or an opensource project that you could use test data against, and see if your application is giving similar results.
Another way to test datamining code is by taking a test set, and then introducing a pattern of the type you're looking for, and then test again, to see if it will separate out the new pattern from the old ones.
And, the tried and true, walk through your own code by hand and see if the code is doing what you meant it to do.

Really, the challenge here is this: because your application is meant to do a fuzzy, non-deterministic kind of task in a smart way, the very goal you hope to achieve is that the application becomes better than human beings at finding these patterns. That's great, powerful, and cool ... but if you pull it off, then it becomes very hard for any human beings to say, "In this case, the answer should be X."
In fact, ideally the computer would say, "Not really. I see why you think that, but consider these 4.2 terabytes of information over here. Have you read them yet? Based on those, I would argue that the answer should be Z."
And if you really succeeded in your original goal, the end user might sometimes say, "Zowie, you're right. That is a better answer. You found a pattern that is going to make us money! (or save us money, or whatever)."
If such a thing could never happen, then why are you asking the computer to detect these kinds of patterns in the first place?
So, the best thing I can think of is to let real life help you build up a list of test scenarios. If there ever was a pattern discovered in the past that did turn out to be valuable, then make a "unit test" that sees if your system discovers it when given similar data. I say "unit test" in quotes because it may be more like an integration test, but you may still choose to use NUnit or VS.Net or RSpec or whatever unit test tools you're using.
For some of these tests, you might somehow try to "mock" the 4.2 terabytes of data (you won't really mock the data, but at some higher level you'd mock some of the conclusions reached from that data). For others, maybe you have a "test database" with some data in it, from which you expect a set of patterns to be detected.
Also, if you can do it, it would be great if the system could "describe its reasoning" behind the patterns it detects. This would let the business user deliberate over the question of whether the application was right or not.

This is tricky. This sounds similar to writing tests around our text search engine. If you keep struggling, you'll figure something out:
Start with a small, simplified but reasonably representative data sample, and test basic behavior doing this
Rather than asserting that the output is exactly some answer, sometimes it's better to figure out what is important about it. For example, for our search engine, I didn't care so much about the exact order the documents were listed, as long as the three key ones were on the first page of results.
As you make a small, incremental change, figure out what the essence of it is and write a test for that. Even though the overall calculations take many inputs, individual changes to the codebase should be isolatable. For example, we found certain documents weren't being surfaced because of the presence of hyphens in some of the key words. We created tests that testing that this was behaving how we expected.
Look at tools like Fitness, which allow you to throw a large number of datasets at a piece of code and assert things about the results. This may be easier to understand than more traditional unit tests.
I've gone back to the product owner, saying "I can't understand how this will work. How will we know if it's right?" Maybe s/he can articulate the essence of the vaguely defined problem. This has worked really well for me many times, and I've talked people out of features because they couldn't be explained.
Be creative!

Ultimately, you have to decide what your program should be doing, and then test for that.

Comments in source code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
How to keep the source code well documented/commented? Is there a tool to generate a skeleton for comments on the Unix platform for C++?
In general, how many lines of comments is recommended for a file with around 100 lines of code?

Generally, it's best to let the code itself explain what it does, whereas the comments are there to describe why it's like that. There is no number to stick to. If your 100 lines speak for themselves, don't comment at all or just provide a summary at the beginning. If there is some knowledge involved that's beyond what the code does, explain it in a comment.
If you're code is too complicated to explain itself, then that may be a reason to refactor.
This way, when you change the implementation you don't need to change the comments as well, as your comments do not duplicate the code. Since the reasons for the design seldom change it's safe to document them in comments for clarity.

Personally I think skeleton comments are a horrible, horrible idea. I understand that sometimes it's nice to save couple of keystrokes and perhaps get argument signatures in comment... but resulting n+1 empty useless comments (when editor has added boilerplates and coder has left them as is) are just more irritating.
I do think comments are needed, at any rate -- if only code one writes is too trivial to ned explanation, chances are code in question is useless (i.e. could have been automated and needn't be hand-written). I tend to comment my code reasonably well because I have learnt that usually I need it myself first. That others can use them is just an added bonus.

In general, how many lines of comments is recommended for a file with around 100 lines of code?
Enough to make your intent clear and to explain any unfamiliar idioms used. There's no rule of thumb, because no two 100 lines of code are the same.
For example, in C#, a property can be given setters and getters like this:
public int ID { get; set; }
Now I hadn't even seen any C# until I joined StackOverflow two weeks ago, but that needs no comment even for me. Commenting that with
// accessor and setter for ID property
would just be noise. Similarly,
for( int i = m ; i < n; ++i) { // "loop from m to n" is a pointless comment
char* p = getP() ; // set p to getP, pure noise.
if( p ) // if p does not eqaul null, pure noise
int a &= 0x3; // a is bitwise or'd with 0x303, pure noise
// mask off all but two least significant bits,
//less noisy but still bad
// get remainder of a / 4, finally a useful comment
Again, any competent coder can read the code to see what it's doing. Any coder with basic experience knows that if( p ) is a common idiom for if( p != 0), which doesn't need explaining. But no one can read your intent unless you comment it.
Comment what you're trying to do, your reason for doing it, not what the code is plainly doing.
On edit: you'll note that after 11 days, no one has commented on intentional error in one of my example comments. That just underscores that that comment is pure noise.

I think this question has a lot of good relevant answers for a similar question: Self-documenting code
As for tools for creating comments, it depends on the editor you're using and the platform. Visual studio automatically creates space for comments, at least it does for C# sometimes. There are also tools that use comments to generate documentation. As for lines counts, I think that's irrelevant. Be as concise and clear as possible.

I think a good guideline is to comment every class and method with a general description of what each is for, especially if you are using an HTML documentation generation tool. Other than that, I try to keep comments to a minimum - only comment code that could potentially be confusing, or require interpretation of intent. Try to write your code in a way that doesn't require comments.
I don't think there is really a metric that you can apply to comments/lines of code, it just depends on the code.

My personal ideal is to write enough commentary so that reading just the comments explains how and why a function is intended to be used. How it works, should usually come out from well chosen variable names and clear implementation.
One way to achieve that, at least on the comment side, is to use a tool such as Doxygen from the beginning. Start coding each new function by writing the comment describing what it is for and how it should be used.
Get Doxygen configured well, have document generation included as a build step, and read the resulting documentation.
The only comment template that might be helpful would be one that sketches in the barest beginning of the Doxygen comment block, but even that might be too much. You want the generated documentation to explain what is important without cluttering it with worthless placeholder text that will never get rewritten.

This is a subject which can be taken to extremes (like many things these days). Enforcing a strong policy sometimes can risk devaluing the exercise (i.e. comments for comment's sake) most of the time, IMHO.
Sometimes an overreaching policy makes sense (e.g. "all public functions must have comment blocks") with exceptions - why bother for generated code?
Commenting should come naturally - should compliment readble code alongside meaningful variable, property and function names (etc).
I don't think there is a useful or accurate measurement of X comments per Y lines of code. You will likely get a good sense of balance through peer reviews (e.g. "this code here should have a comment explaining it's purpose").
I'm not sure about auto-comment tools for C/C++ but the .Net equivalent would have to be GhostDoc. Again, these tools only help define a comment structure - meaning still needs to be added by a developer, or someone who has to interpret the point of the code or design.

Commenting code is essential if your auto generating your documentation (we use doxygen). Otherwise it's best to keep it to a minimum.
We use a skeleton for every method in the .cpp file.
//**************************************************************************************************
//
/// #brief
/// #details
/// #param
/// #return
/// #sa
//
//**************************************************************************************************
but this is purely due to our documentation needs

The rules I try to follow:
write code that is auto-documented: nice and clear variable names,
resist the temptation of clever hacks, etc. This advice depends a
lot on the programming language you use: it is is much easier to
follow with Python than with C.
comment at the beginning to guide the reader so that they know
immediately what they are to expect.
comment what is not obvious from the code. If you had trouble
writing a piece of code, it may mean it desserves a comment.
the API of a library is a special case: it requires
documentation (and to put it in the code is often a good idea,
especially with tools like Doxygen). Just do
not confuse this documentation intended for users with the one
which will be useful for the maintainers of the library.
comment what cannot be in the code, such as policy requirments that
explain why things are the way they are.
comment background information such a the reference to a scientific
article which describes the clever algorithm you use, or the RFC
standardizing the network protocol you implement.
comment the hacks! Everyone is sometimes forced to use hacks or
workarounds but be nice for the future maintainer, comment it. Read
"Technical debt".
And don't comment the rest. Quantitative rules like "20 % of the lines
must be comments" are plainly stupid and clearly intended only for
PHBs.

I'm not aware of any tool but I feel it's always good to have some comments in the code if it is to be maintained by someone else in the future. At least, it's good to have header blocks for classes and methods detailing what the class is meant for and what the method does. But yes, it is good to keep the comments as minimal as possible.

I prefer to use comments to explain
what a class\function is intended to do,
what it is not supposed to do,
any assumptions I make that the users of the class\function should adhere to.
For the users of vi editor the following plug-in is very helpful. We can define templates for class comments, function comments etc.
csupport plug in

There are no good rules in terms of comment/code ratios. It totally depends on the complexity of your code.
I do follow one (and only one) rule with respect to comments (I like to stay flexible).
The code show how things are done, the comments show what is done.
Some code doesn't need comments at all, due to it's obviousness: this can often be achieved by use of good variable names. Mostly, I'll comment a function then comment major blocks withing the function.
I consider this bad:
// Process list by running through the whole list,
// processing each node within the list.
//
void processList (tNode *s) {
while (s != NULL) { // Run until reached end of list.
processNode (s); // Process the node.
s = s->nxt; // Move to next node.
}
}
since all you're doing there is writing the code thrice. I would prefer something like:
// Process list (or rest of list if you pass a non-start node).
//
void processList (tNode *currentNode) {
// Run through the list, processing each node.
while (currentNode != NULL) {
processNode (currentNode);
currentNode = currentNode->nextNode;
}
}

You guys may argue about but i realy believe in it:
Usually , You don't have to write comments. Simply as that. The code has to be written in such way that is explain itself , if it doesn't explain itself and you have to write comments , then something is wrong.
There are however some exceptional cases:
You have to write something that is VERY cryptic to gain performence. So here you may need to write some explanation.
You provide a library to some other group/company , It is better you document its API.
There are too many novice programemers in your organization.

I wouldn't be so rude to say that comments are excuse for badly programmed code like some people above, nor to say you don't need them.
It also depends on your editor and how do you like to see your code in it, and how would you like others to do that.
For instance, I like to create regions in C#. Regions are named collapsable areas of code, in some way commented code containers. That way, when I look at the editor, I actually look at pseudo code.
#region Connect to the database
// ....
#endregion
#region Prepare tables
#region Cache tables ...
#endregion
#region Fix column names ...
#endregion
#endregion
This kind of code is more readable then anything else I know but ofcourse, it needs editor supporing custom folding with names. (like Visual Studio editor, VIM... ). Somebody will say that you can achieve the similar if you put regions into procedures but first, you can't always do that, second, you have to jump to the procedure to see its code. If you simply set hotkies to open/collapse the region you can quickly see the code in it while scrolling and reading text and generally quickly move over the hierarchy of regions.
About line comments, it would be good to write code that autodocuments itself, but unfortunatelly, this can't be said in generall. This ofcourse depends on projects, its domain and its complexity.
As a last note, I fully suggest in-code documentation via portable and language indepent tool, like for instance NaturalDocs that can be made to work with any language around with natural syntax that doesn't include XML or any kind of special formating (hence the name) plus it doesn't need to be installed more then once.
And if there is a guy that don't like comments, he can always remove them using some simple tool. I even integrated such tool in my editor and comments are gone via simple menu click. So, comments can't harm the code in any way that can't be fixed very fast.

I say that generally comments are a bad smell. But inline code documentation is great. I've elaborated more on the subject over at robowiki.net:
Why Comments are Bad

I concur with everyone about self-documenting code. And I also agree about the need for special comments when it comes to documentation generation. A short comment at the top of each method/class is useful, especially if your IDE can use it for tooltips in code completion (like Visual Studio).
Another reason for comments that I don't see mentioned here is in type-unsafe languages like JavaScript or PHP. You can specify data types that way, although hungarian notation can help there as well (one of the rare cases for using it properly, I think).
Also, tools like PHPLint can use some special type-related comments to check your code for type-safety.

There are no metrics you can sensibly use for comments. You should never say x lines of code must have y comments, because then you will end up with silly useless comments that simply restate code, and these will degrade the quality of your code.
100 lines of code should have as few comments as possible.
Personally, having used them in the past, I'd not use things like doxygen to document internal code to the extent of every function and every parameter needing tagged descriptions because with well factored code you have many functions and with good names, most often these tagged descriptions don't say any more than the parameter name itself.

My opinion - comments in source code is evil. Code should be self documented. Developers usually forget about reading and updating them.
As sad Martin Fowler: "if you need comment for lines block - just make new function" (this not quote - this phrase as I remember it).
It will be better to keep separated documentation for utility modules, basic principles of your project, organization of libraries, some algorithms and design ideas.
Almost forget: I was used code comments once. It was MFC/COM - project and I leave links from MSDN howto articles near non-trivial solutions/workarounds.
100 lines of source code - should be understandable if not - it should be separated or reorganized on few functions - which will be more understandable.
Is there a tool to generate a skeleton
for comments on the Unix platform for
C++?
Vim have plugins for inserting doxygen comments template, if you really need this.

Source code should always be documented where needed. People have argued for what and what not to document. However I wanted to attribute with one more note.
Let's say I have implemented a method that return a/b
So as the programmer I am a great citizen, and I will hint the user what to expect.
/**
* Will return 0 if b is 0, to prevent the world from exploding.
*/
float divide(float a, float b) {
if (b == 0) return 0;
return a/b;
}
I know, this is pretty obvious that nobody would ever create such a method. But this can be reflected to other issues, where users of an API can't figure out what a function expects.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js