Is it possible to use TDD with image processing algorithms? - c++

Recently, I have worked in a project were TDD (Test Driven Development) was used. The project was a web application developed in Java and, although unit-testing web applications may not be trivial, it was possible using mocking (we have used the Mockito framework).
Now I will start a project where I will use C++ to work with image processing (mostly image segmentation) and I'm not sure whether using TDD is a good idea. The problem is that is very hard to tell whether the result of a segmentation is right or not, and the same problem applies to many other image processing algorithms.
So, what I would like to know is if someone here have successfully used TDD with image segmentation algorithms (not necessarily segmentation algorithms).

at a minimum you can use the tests for regression testing. For example, suppose you have 5 test images for a particular segmentation algorithm. You run the 5 images through the code and manually verify the results. The results, when correct, are stored on disk somewhere, and future executions of these tests compare the generated results to the stored results.
that way, if you ever make a breaking change, you'll catch it, but more importantly you only have to go through a (correct) manual test cycle once.

Whenever I do any computer-vision related development TDD is almost standard practice. You have images and something you want to measure. Step one is to hand-label a (large) subset of the images. This gives you test data. The process (for full correctness) is then to divide your test-set in two, a "development set" and a "verification set". You do repeated development cycles until your algorithm is accurate enough when applied to the development set. Then you verify the result on the veriication set (so that you're not overtraining on some weird aspect of your development set.
This is test driven development at its purest.
Note that you're testing two different things when developing heavily algorithm dependent software like this.
The regular bugs you'll get in your software. These can be tested using "normal" TDD techniques
The performance of your algorithm, for which you need a system outlined above.
A program can be bug free according to (1) but not quite according to (2). For example, a very simple image segmentation algorithm says: "the left half of the image is one segment, the right half is another segment. This program can be made bug free according to (1) quite easily. It is another matter entirely wether it satisfies your performance needs. Don't confuse the two aspects, and don't let one interfere with the other.
More specifically, I'd advice you to develop the algorithm first, buggy warts and all, and then use TDD with the algorithm (not the code!) and perhaps other requirements of the software as specification for a separate TDDevelopment process. Doing unit tests for small temporary helper functions deep within some reasonably complex algorithm under heavy development is a waste of time and effort.

TDD in image processing only makes sense for deterministic problems like:
image arithmetic
histogram generation
and so on..
However TDD is not suitable for feature extraction algorithms like:
edge detection
segmentation
corner detection
... since no algorithm can solve this kind of problems for all images perfectly.

I think the best you can do is test the simple, mathematically well-defined building blocks your algorithm consists of, like linear filters, morphological operations, FFT, wavelet transforms etc. These are often tricky enough to implement efficiently and correctly for all border cases so verifying them does make sense.
For an actual algorithm like image segmentation, TDD doesn't make much sense IMHO. I don't even think unit-tests make sense here. Sure, you can write tests, but those will always be extremely fragile. A typical image processing algorithm needs a few parameters that have to be adjusted for the desired results (a process that can't be automated, and can't be done before the algorithm is working). The results of a segmentation algorithm aren't well defined either, but your unit test can only test for some well-defined property. An algorithm can have that property without doing what you want, or the other way round, so your test result isn't very informative. Also, to test the results of a segmentation algorithm you need to write a lot of pretty hard code, while verifying the results visually is pretty easy and you have to do it anyway.
I think in a way it's similar to unit-testing user interfaces: Testing the actual well-defined functionality (e.g. when the user clicks this button, some item is added to this list and this label shows that text...) is relatively easy and can save a lot of work and debugging. But no test in the world will tell you if your UI is usable, understandable or pretty, because these things just aren't well defined.

we had some discussion on the very same "problem" with many remarks mentioned in your comments below those answers here.
We came to the end, that TDD in in computer vision / image processing (concerning the global goal of segmention, detection or sth like that) could be:
get an image/sequence that should be processed and create a test for that image: desired output and a metric to tell how far your result may differ from that "ground truth".
get another image/sequence for a different setting (different lighting, different objects or something like that), where your algorithm fails and write a test for that.
improve your algorithm in a way that it solves all previous tests.
go back to 2.
no idea whether this is applicable, creating the tests will be much more complex than in traditional TDD since it might be hard to define the allowed differences between your ground truth and your algorithm output.
Probably it's better to just use some QualityDrivenDevelopment where your changes just shouldnt make things "worse" (you again have to find a metric for that) than before.
Obiviously you still can use traditional unit testing for deterministic parts of those algorithms, but that's not the real problem of "TDD-in-signal-processing"

The image processing tests that you describe in your question take place at a much higher level than most of the tests that you will write using TDD.
In a true Test Driven Development process you will first write a failing test before adding any new functionality to your software, then write the code that causes the test to pass, rinse and repeat.
This process yields a large library of Unit Tests, sometimes with more LOC of tests than functional code!
Because your analytic algorithms have structured behavior, they would be an excellent match for a TDD approach.
But I think the question you are really asking is "how do I go about executing a suite of Integration Tests against fuzzy image processing software?" You might think I am splitting hairs, but this distinction between Unit Tests and Integration Tests really gets to the heart of what Test Driven Development means. The benefits of the TDD process come from the rich supporting fabric of Unit Tests more than anything else.
In your case I would compare the Integration Test suite to automated performance metrics against a web application. We want to accumulate a historical record of execution times, but we probably don't want to explicitly fail the build for a single poorly performing execution (which might have been affected by network congestion, disk I/O, whatever). You might set some loose tolerances around performance of your test suite and have the Continuous Integration server kick out daily reports that give you a high level overview of the performance of your algorithm.

I'd say TDD is much easier in such an application than in a web one. You have a completely deterministic algorithm you have to test. You don't have to worry about fuzzy stuff like user input and HTML rendering.
Your algorithm consists of a number of steps. Each of these steps can be tested. If you give them fixed, known input, they should yield fixed, known output. So write a test for that. You can't test that the algorithm "is correct" in general, but you can give it data for which you've already precomputed the correct result, so you can verify that it yields the correct output in that case.

I am not really into your problem, so I don't know its hot spots. However, the final result of your algorithm is hopefully deterministic, so you can perform functional testing on it. Of course, you will have to determine a "known good" result. I know of TDD performed on graphic libraries (VTK, to be precise). The comparison is done on the final result image, pixel by pixel. Without going in so much detail, if you have a known good result, you can perform an md5 of the test result and compare it against the md5 of the known-good.
For unit testing, I am pretty sure you can test individual routines. This will force you to have a very fine-grained development style.

Might want to take a look at this paper

If your goal is to optimize an algorithm rather than verifying correctness you need a metric. A good metric would measure the performance criteria underlying in your algorithm. For a segmentation algorithm this could be the sum of standard deviations of pixel data within each segment. Using the metric you can use threshold levels of acceptance or rank versions of the algorithm.

You can use a statistical approach where you have many examples and correct outcomes, and the test runs all of them and evaluates the algorithm on them. It then produces a single number that is the combined success rate of all of them.
This way you are less sensitive to specific failures and your test is more robust.
You can then use a threshold on the success rate to see if the test failed or not.

Related

Test driven development for signal processing libraries

I work with audio manipulation, generally using Matlab for prototyping, and C++ for implementation. Recently, I have been reading up on TDD. I have looked over a few basic examples and am quite enthusiastic about the paradigm.
At the moment, I use what I would consider a global 'test-assisted' approach. For this, I write signal processing blocks in C++, and then I make a simple Matlab mex file that can interface with my classes. I subsequently add functionality, checking that the results match up with an equivalent Matlab script as I go. This works ok, but the tests become obsolete quickly as the system evolves. Furtermore, I am testing the whole system, not just units.
It would be nice to use an established TDD framework where I can have a test suite, but I don't see how I can validate the functionality of the processing blocks without tests that are equally as complex as the code under test. How would I generate the reference signals in a C++ test to validate a processing block without the test being a form of self-fulfilling prophecy?
If anyone has experience in this area, or can suggest some methodologies that I could read into, then that would be great.
I think it's great to apply the TDD approach to signal processing (it would have saved me months of time if I knew about it years ago when I was doing signal processing myself). I think the key is to break down your system into the lowest level components that can be independently tested, eg:
FFTs: test signals at known frequencies: DC, Fs/Nfft, Fs/2 and different phases etc. Check the peaks and phase are as you expect, check the normalisation constant is as you expect
peak picking: test that you correctly find maxima/minima
Filters: generate input at known frequencies and check the output amplitude and phase is as expected.
You are unlikely to get exactly the same results out between C++ and Matlab, so you'll have to supply error bounds on some of the tests. TDD is a great way of not only verifying the correctness of the code you have but is really useful when trying out different implementations. For example if you want to replace one FFT implementation with another, there are often slight differences with the way the data is packed, or the normalisation constant that is used. TDD will give you a high degree of confidence the new library is correctly integrated.
I do something similar for heuristics detection, and we have loads and loads of capture files and a framework to be able to load and inject them for testing. Do you have the possibility to capture the reference signals in a file and do the same?
As for my 2 cents regarding TDD, its a great way to develop, but as with most paradigms, you dont always have to follow it to the letter, there are times when you should know how to bend the rules a bit, so as not to write too much throw-away code/tests. I read about one approach that said absolutely no code should be written until a test is developed, which at times can be way too strict.
On the other hand, I always like to say: "If its not tested, its broken" :)
It's OK for the test to be as complex or more complex than the code under development. If you change (update, refactor, bug fix) the code and not the test, the unit test will warn you that something changed and needs to be reviewed (was a bug fix for mode A supposed to change mode B?, etc.)
Furthermore, you can maintain the APIs for the individual compute components, and not just for the entire end-to-end system.
I've only just starting thinking about TDD in the context of signal processing, so I can only add a bit to the previous answers. What I've done is exploit a bit of superposition to test primitives. For example, testing an IIR filter, I independently verified b0, b1, and b2 elements with unit and scaled gains, and then verified a1 and a2 elements that followed easily modeled decays. My test signal was a combination of ramp functions for the numerator and impulse functions for the denominator. I know it's a trivial example, but the process should work for plenty of linear operations. Tests should also exercise unstable regions and show that outputs explode appropriately.
In general, I expect that impulse responses are going to do a lot of the work for me, since many situations will see them reduce to trigonometric functions, which can be independently calculated. Similarly, if your operation has a series expansion, your test function could perform the expansion to a relevant order and compare against against your processing block. It'll be slow, but it should work.

Is my code really not unit-testable?

A lot of code in a current project is directly related to displaying things using a 3rd-party 3D rendering engine. As such, it's easy to say "this is a special case, you can't unit test it". But I wonder if this is a valid excuse... it's easy to think "I am special" but rarely actually the case.
Are there types of code which are genuinely not suited for unit-testing? By suitable, I mean "without it taking longer to figure out how to write the test than is worth the effort"... dealing with a ton of 3D math/rendering it could take a lot of work to prove the output of a function is correct compared with just looking at the rendered graphics.
Code that directly relates to displaying information, generating images and even general UI stuff, is sometimes hard to unit-test.
However that mostly applies only to the very top level of that code. Usually 1-2 method calls below the "surface" is code that's easily unit tested.
For example, it may be nontrivial to test that some information is correctly animated into the dialog box when a validation fails. However, it's very easy to check if the validation would fail for any given input.
Make sure to structure your code in a way that the "non-testable" surface area is well-separated from the test and write extensive tests for the non-surface code.
The point of unit-testing your rendering code is not to demonstrate that the third-party-code does the right thing (that is for integration and regression testing). The point is to demonstrate that your code gives the right instructions to the third-party code. In other words, you only have to control the input of your code layer and verify the output (which would become the input of the renderer).
Of course, you can create a mock version of the renderer which does cheap ASCII graphics or something, and then verify the pseudo-graphics if you want and this makes the test clearer if you want, but it is not strictly necessary for a unit test of your code.
If you cannot break your code into units, it is very hard to unit test.
My guess would be that if you have 3D atomic functions (say translate, rotate,
and project a point) they should be easily testable - create a set of test points and test whether the transformation takes a point to where it should.
If you can only reach the 3D code through a limited API, then it would be hard to test.
Please see Misko Hevery's Testability posts and his testability guide.
I think this is a good question. I wrestle with this all the time, and it seems like there are certain types of code that fit into the unit testing paradigm and other types that do not.
What I consider clearly unit-testable is code that obviously has room for being wrong. Examples:
Code to compute hairy math or linear algebra functions. I always write an auxiliary function to check the answers, and run it once in a while.
Hairy data structure code, with cross-references, guids, back-pointers, and methods for incrementally keeping it consistent. These are really easy to break, so unit tests are good for seeing if they are broken.
On the other hand, in code with low redundancy, if the code compiles it may not be clear what being wrong even means. For example, I do pretty complicated UIs using dynamic dialogs, and it's not clear what to test. All the kinds of things like event handling, layout, and showing / hiding / updating of controls that might make this code error-prone are simply dealt with in a well-verified layer underneath.
The kind of testing I find myself needing more than unit-testing is coverage testing. Have I tried all the possible features and combinations of features? Since this is a very large space and it is prohibitive to write automated tests to cover it, I often find myself doing monte-carlo testing instead, where feature selections are chosen at random and submitted to the system. Then the result is examined in an automatic and / or manual way.
If you can grab the rendered image, you can unit test it.
Simply render some images with the current codebase, see if they "look right" (examining them down to the pixel if you have to), and store them for comparison. Your unit tests could then compare to those stored images and see if the result is the same.
Whether or not this is worth the trouble, that's for you to decide.
Break down the rendering into steps and test by comparing the frame buffer for each step to a known good images.
No matter what you have, it can be broken down to numbers which can be compared. The real trick is when you havbe some random number generator in the algorithm, or some other nondeterministic part.
With things like floating point, you might need to subtract the generated data from the expected data and check that the difference is less than some error threshold.
Well you can't unit test certain kinds of exception code but other than that ...
I've got true unit tests for some code that looks impossible to even attach a test harness to and code that looks like it should be unit testable but isn't.
One of the ways you know your code is not unit testable is when it depends on the physical characteristics of the device it runs on. Another kind of not unit-testable code is direct UI code (and I find a lot of breaks in direct UI code).
I've also got a huge chunk of non unit-testable code that has appropriate integration tests.

Unit Testing Machine Learning Code

I am writing a fairly complicated machine learning program for my thesis in computer vision. It's working fairly well, but I need to keep trying out new things out and adding new functionality. This is problematic because I sometimes introduce bugs when I am extending the code or trying to simplify an algorithm.
Clearly the correct thing to do is to add unit tests, but it is not clear how to do this. Many components of my program produce a somewhat subjective answer, and I cannot automate sanity checks.
For example, I had some code that approximated a curve with a lower-resolution curve, so that I could do computationally intensive work on the lower-resolution curve. I accidentally introduced a bug into this code, and only found it through a painstaking search when my the results of my entire program got slightly worse.
But, when I tried to write a unit-test for it, it was unclear what I should do. If I make a simple curve that has a clearly correct lower-resolution version, then I'm not really testing out everything that could go wrong. If I make a simple curve and then perturb the points slightly, my code starts producing different answers, even though this particular piece of code really seems to work fine now.
You may not appreciate the irony, but basically what you have there is legacy code: a chunk of software without any unit tests. Naturally you don't know where to begin. So you may find it helpful to read up on handling legacy code.
The definitive thought on this is Michael Feather's book, Working Effectively with Legacy Code. There used to be a helpful summary​ of that on the ObjectMentor site, but alas the website has gone the way of the company. However WELC has left a legacy in reviews and other articles. Check them out (or just buy the book), although the key lessons are the ones which S.Lott and tvanfosson cover in their replies.
2019 update: I have fixed the link to the WELC summary with a version from the Wayback Machine web archive (thanks #milia).
Also - and despite knowing that answers which comprise mainly links to other sites are low quality answers :) - here is a link to a new (2019 new) Google tutorial on Testing and Debugging ML code. I hope this will be of illumination to future Seekers who stumble across this answer.
"then I'm not really testing out everything that could go wrong."
Correct.
The job of unit tests is not to test everything that could go wrong.
The job of unit tests is to test that what you have does the right thing, given specific inputs and specific expected results. The important part here is the specific visible, external requirements are satisfied by specific test cases. Not that every possible thing that could go wrong is somehow prevented.
Nothing can test everything that could go wrong. You can write a proof, but you'll be hard-pressed to write tests for everything.
Choose your test cases wisely.
Further, the job of unit tests is to test that each small part of the overall application does the right thing -- in isolation.
Your "code that approximated a curve with a lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation. The integrated whole could also be tested to be sure that it works.
Your "computationally intensive work on the lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation.
That point of unit testing is to create small, correct units that are later assembled.
Without seeing your code, it's hard to tell, but I suspect that you are attempting to write tests at too high a level. You might want to think about breaking your methods down into smaller components that are deterministic and testing these. Then test the methods that use these methods by providing mock implementations that return predictable values from the underlying methods (which are probably located on a different object). Then you can write tests that cover the domain of the various methods, ensuring that you have coverage of the full range of possible outcomes. For the small methods you do so by providing values that represent the domain of inputs. For the methods that depend on these, by providing mock implementations that return the range of outcomes from the dependencies.
Your unit tests need to employ some kind of fuzz factor, either by accepting approximations, or using some kind of probabilistic checks.
For example, if you have some function that returns a floating point result, it is almost impossible to write a test that works correctly across all platforms. Your checks would need to perform the approximation.
TEST_ALMOST_EQ(result, 4.0);
Above TEST_ALMOST_EQ might verify that result is between 3.9 and 4.1 (for example).
Alternatively, if your machine learning algorithms are probabilistic, your tests will need to accommodate for it by taking the average of multiple runs and expecting it to be within some range.
x = 0;
for (100 times) {
x += result_probabilistic_test();
}
avg = x/100;
TEST_RANGE(avg, 10.0, 15.0);
Ofcourse, the tests are non-deterministic, so you will need to tune them such that you can get non-flaky tests with a high probability. (E.g., increase the number of trials, or increase the range of error).
You can also use mocks for this (e.g, a mock random number generator for your probabilistic algorithms), and they usually help for deterministically testing specific code paths, but they are a lot of effort to maintain. Ideally, you would use a combination of fuzzy testing and mocks.
HTH.
Generally, for statistical measures you would build in an epsilon for your answer. I.E. the mean square difference of your points would be < 0.01 or some such. Another option is to run several times and if it fails "too often" then you have an issue.
Get an appropriate test dataset (maybe a subset of what your using usually)
Calculate some metric on this dataset (e.g. the accuracy)
Note down the value obtained (cross-validated)
This should give an indication of what to set the threshold for
Of course if can be that when making changes to your code the performance on the dataset will increase a little, but if it ever decreases by large this would be an indication something is going wrong.

Testing When Correctness is Poorly Defined?

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions.
However, a lot of the code I write is data mining code that basically looks for significant patterns in large datasets. Correct behavior in this case is often not well defined and depends on a lot of different inputs in ways that are not easy for a human to predict (i.e. the math can't reasonably be done by hand, which is why I'm using a computer to solve the problem in the first place). These inputs can be very complex, to the point where coming up with a reasonable test case is near impossible. Identifying the edge cases that are worth testing is extremely difficult. Sometimes the algorithm isn't even deterministic.
Usually, I do the best I can by using asserts for sanity checks and creating a small toy test case with a known pattern and informally seeing if the answer at least "looks reasonable", without it necessarily being objectively correct. Is there any better way to test these kinds of cases?
I think you just need to write unit tests based on small sets of data that will make sure that your code is doing exactly what you want it to do. If this gives you a reasonable data-mining algorithm is a separate issue, and I don't think it is possible to solve it by unit tests. There are two "levels" of correctness of your code:
Your code is correctly implementing the given data mining algorithm (this thing you should unit-test)
The data mining algorithm you implement is "correct" - solves the business problem. This is a quite open question, it probably depends both on some parameters of your algorithm as well as on the actual data (different algorithms work for different types of data).
When facing cases like this I tend to build one or more stub data sets that reflect the proper underlying complexities of the real-life data. I often do this together with the customer, to make sure I capture the essence of the complexities.
Then I can just codify these into one or more datasets that can be used as basis for making very specific unit tests (sometimes they're more like integration tests with stub data, but I don't think that's an important distinction). So while your algorithm may have "fuzzy" results for a "generic" dataset, these algorithms almost always have a single correct answer for a specific dataset.
Well, there are a few answers.
First of all, as you mentioned, take a small case study, and do the math by hand. Since you wrote the algorithm, you know what it's supposed to do, so you can do it in a limited case.
The other one is to break down every component of your program into testable parts.
If A calls B calls C calls D, and you know that A,B,C,D, all give the right answer, then you test A->B, B->C, and C->D, then you can be reasonably sure that A->D is giving the correct response.
Also, if there are other programs out there that do what you are looking to do, try and aquire their datasets. Or an opensource project that you could use test data against, and see if your application is giving similar results.
Another way to test datamining code is by taking a test set, and then introducing a pattern of the type you're looking for, and then test again, to see if it will separate out the new pattern from the old ones.
And, the tried and true, walk through your own code by hand and see if the code is doing what you meant it to do.
Really, the challenge here is this: because your application is meant to do a fuzzy, non-deterministic kind of task in a smart way, the very goal you hope to achieve is that the application becomes better than human beings at finding these patterns. That's great, powerful, and cool ... but if you pull it off, then it becomes very hard for any human beings to say, "In this case, the answer should be X."
In fact, ideally the computer would say, "Not really. I see why you think that, but consider these 4.2 terabytes of information over here. Have you read them yet? Based on those, I would argue that the answer should be Z."
And if you really succeeded in your original goal, the end user might sometimes say, "Zowie, you're right. That is a better answer. You found a pattern that is going to make us money! (or save us money, or whatever)."
If such a thing could never happen, then why are you asking the computer to detect these kinds of patterns in the first place?
So, the best thing I can think of is to let real life help you build up a list of test scenarios. If there ever was a pattern discovered in the past that did turn out to be valuable, then make a "unit test" that sees if your system discovers it when given similar data. I say "unit test" in quotes because it may be more like an integration test, but you may still choose to use NUnit or VS.Net or RSpec or whatever unit test tools you're using.
For some of these tests, you might somehow try to "mock" the 4.2 terabytes of data (you won't really mock the data, but at some higher level you'd mock some of the conclusions reached from that data). For others, maybe you have a "test database" with some data in it, from which you expect a set of patterns to be detected.
Also, if you can do it, it would be great if the system could "describe its reasoning" behind the patterns it detects. This would let the business user deliberate over the question of whether the application was right or not.
This is tricky. This sounds similar to writing tests around our text search engine. If you keep struggling, you'll figure something out:
Start with a small, simplified but reasonably representative data sample, and test basic behavior doing this
Rather than asserting that the output is exactly some answer, sometimes it's better to figure out what is important about it. For example, for our search engine, I didn't care so much about the exact order the documents were listed, as long as the three key ones were on the first page of results.
As you make a small, incremental change, figure out what the essence of it is and write a test for that. Even though the overall calculations take many inputs, individual changes to the codebase should be isolatable. For example, we found certain documents weren't being surfaced because of the presence of hyphens in some of the key words. We created tests that testing that this was behaving how we expected.
Look at tools like Fitness, which allow you to throw a large number of datasets at a piece of code and assert things about the results. This may be easier to understand than more traditional unit tests.
I've gone back to the product owner, saying "I can't understand how this will work. How will we know if it's right?" Maybe s/he can articulate the essence of the vaguely defined problem. This has worked really well for me many times, and I've talked people out of features because they couldn't be explained.
Be creative!
Ultimately, you have to decide what your program should be doing, and then test for that.

How is unit testing better than just testing the entire output of your application as a whole?

I don't understand how an unit test could possibly benefit.
Isn't it sufficient for a tester to test the entire output as a whole rather than doing unit tests?
Thanks.
What you are describing is integration testing. What integration testing will not tell you is which piece of your massive application is not working correctly when your output is no longer correct.
The advantage to unit testing is that you can write a test for each business assumption or algorithm step that you need your program to perform. When someone adds or changes code to your application, you immediately know exactly which step, which piece, and maybe even which line of code is broken when a bug is introduced. The time savings on maintenence for that reason alone makes it worthwhile, but there is an even bigger advantage in that regression bugs cannot be introduced (assuming your tests are running automatically when you build your software). If you fix a bug, and then write a test specifically to catch that bug in the future, there is no way someone could accidentally introduce it again.
The combination of integration testing and unit testing can let you sleep much easier at night, especially when you've checked in a big piece of code that day.
The earlier you catch bugs, the cheaper they are to fix. A bug found during unit testing by the coder is pretty cheap (just fix the darn thing).
A bug found during system or integration testing costs more, since you have to fix it and restart the test cycle.
A bug found by your customer will cost a lot: recoding, retesting, repackaging and so forth. It may also leave a painful boot print on your derriere when you inform management that you didn't catch it during unit testing because you didn't do any, thinking that the system testers would find all the problems :-)
How much money would it cost GM to recall 10,000 cars because the catalytic converter didn't work properly?
Now think of how much it would cost them if they discovered that immediately after those converters were delivered to them, but before they were put into those 10,000 cars.
I think you'll find the latter option to be quite a bit cheaper.
That's one reason why test driven development and continuous integration are (sometimes) a good thing - testing is done all the time.
In addition, unit tests don't check that the program works as a whole, just that each little bit performs as expected. That's often quite a lot more than higher level tests would check.
From my experience:
Integration and functional testing tend to be more indicative of the overall quality of the system, than unit test suit is.
High level testing (functional, acceptance) is a QA tool.
Unit testing is a development tool. Especially in a TDD context, where unit test becomes more of a design implement, rather than that of a quality assurance.
As a result of better design, quality of the entire system improves (indirectly).
Passing unit test suite is meant to ensure that a single component conforms to the developer's intentions (correctness). Acceptance test is the level that covers validity of the system (i.e. system does what user want it to do).
Summary:
Unit test is meant as a development tool first, QA tool second.
Acceptance test is meant as a QA tool.
There is still a need for a certain level of manual testing to be performed but unit testing is used to decrease the number of defects that make it to that stage. Unit testing tests the smallest parts of the system and if they all work the chances of the application as a whole working correctly are increased significantly.
It also assists when adding new features since regression testing can be performed quickly and automatically.
For a complex enough application, testing the entire output as a whole may not cover enough different possibilities. For example, any given application has a huge number of different code paths that can be followed depending on input. In typical testing, there may be many parts of your code that are simply never encountered, because they are only used in certain circumstances, so you can't be sure that any code that isn't run in your test situation, actually works. Also, errors in one section of code may be masked a majority of the time by something else in another section of code, so you may never discover some errors.
It is better to test each function or class separately. That way, the test is easier to write, because you are only testing a certain small section of the code. It's also easier to cover every possible code path when testing, and if you test each small part separately then you can detect errors even when those errors would often be masked by other parts of your code when run in your application.
Do yourself a favor and try out unit testing first. I was quite the skeptic myself until I realized just how darned helpful/powerful unit-tests can be. If you think about it, they aren't really there to add to your workload. They are there to provide you with peace of mind and allow you to continue extending your application while ensuring that your code is solid. You get immediate feedback as to when you may have broke something and this is something of extraordinary value.
To your question regarding why to test small sections of code consider this: Suppose your giant app uses a cool XOR encryption scheme that you wrote and eventually product management changes the requirements of how you generate these encrypted strings. So you say: "Heck, I wrote the the encryption routine so I'll go ahead and make the change. It'll take me 15 minutes and we'll all go home and have a party." Well, perhaps you introduced a bug during this process. But wait!!! Your handy dandy TestXOREncryption() test method immediately tells you that the expected output did not match the input. Bingo, this is why you broke down your unit tests ahead of time into small "units" to test for because in your big giant application you would not have figured this out nearly as fast.
Also, once you get into the frame of mind of regularly writing unit tests you'll realize that although you pay an upfront cost in the beginning in terms of time, you'll get that back 10 fold later in the development cycle when you can quickly identify areas in your code that have introduced problems.
There is no magic bullet with unit tests because your ability to identify problems is only as good as the tests you write. It boils down to delivering a better product and relieving yourself of stress and headaches. =)
Agree with most of the answers. Let's drill down on the topic of speed. Here are some real numbers:
Unit test results in 1 or 2 minutes from a
fresh compile. As true unit tests
(no interaction with external
systems like dbs) they can cover a
lot of logic really fast.
Automated functional test results in 1 or 2 hours. These run on a simplified platform, but sometimes cover multiple systems and the database - which really kills the speed.
Automated integration test results once a day. These exercise the full meal deal, but are so heavy and slow, we can only execute them once a day and it takes a few hours.
Manual regression results come in after a few weeks. We get stuff over to testers a few times a day, but your change isn't realistically regressed for week or two at best.
I want to find out what I broke in 1 or 2 minutes, not a few weeks, not even a few hours. That's where the 10fold ROI on unit tests that people talk about comes from.
This is a tough question to approach because it questions something of such enormous breadth. Here's my short answer, however:
Test Driven Development (or TDD) seeks to prove that every logical unit of an application (or block of code) functions exactly as it should. By making tests as automated as possible for productivity's sake, how could this really be harmful?
By testing every logical piece of code, you can trust the usage of the code up some hierarchy. Say I build an application that relies on a thread-safe stack implementation. Shouldn't the stack be guaranteed to work up at every stage before I build on it?
The key is that if something in the whole application breaks, meaning just looking at the total output/outcome, how do you know where it came from? Well, debugging, of course! Which puts you back where you started. TDD allows you to -hopefully- bypass this most painful stage in development.
Testers generally test end to end functionality. Obviously this is geared for going at user scenarios and has incredible value.
Unit Tests serve a different functionality. The are the developers way of verifying the components they write work correctly in the absence of other features or in combination with other features. This offers a range of value including
Provides un-ignorable documentation
Ability to isolate bugs to specific components
Verify invariants in the code
Provide quick, immediate feedback to changes in the code base.
One place to start is regression testing. Once you find a bug, write a small test that demonstrates the bug, fix it, then make sure the test now passes. In future you can run that test before each release to ensure that the bug has not been reintroduced.
Why do that at a unit level instead of a whole-program level? Speed. In good code it's much faster to isolate a small unit and write a tiny test than to drive a complex program through to the bug point. Then when testing a unit test will generally run significantly faster than an integration test.
Very simply: Unit tests are easier to write, since you're only testing a single method's functionality. And bugs are easier to fix, since you know exactly what method is broken.
But like the other answerers have pointed out, unit tests aren't the end-all-be-all of testing. They're just the smallest piece of the equation.
Probably the single biggest difficulty with software is the sheer number of interacting things, and the most useful technique is to reduce the number of things that have to be considered.
For example, using higher-level languages rather than lower-level improves productivity, because one line is a separate thing, and being able to write a program in fewer lines reduces the number of things.
Procedural programming came about as an attempt to reduce complexity by making it possible to treat a function as a thing. In order to do that, though, we have to be able to think about what the function does in a coherent manner, and with confidence that we're right. (Object-oriented programming does a similar thing, on a larger scale.)
There are several ways to do this. Design-by-contract is a way of exactly specifying what the function does. Using function parameters rather than global variables to call the function and get results reduces the complexity of the function.
Unit testing is one way to verify that the function does what it is supposed to. It's usually possible to test all the code in a function, and sometimes all the execution paths. It is a way to tell if the function works as it should or not. If the function works, we can think about it as a single thing, rather than as multiple things we have to keep track of.
It serves other purposes. Unit tests are usually quick to run, and so can catch bugs quickly, when they're easy to fix. If developers make sure a function passes the tests before being checked in, then the tests are a form of documenting what the function does that is guaranteed correct. The act of creating the tests forces the test writer to think about what the function should be doing. After that, whoever wanted the change can look at the tests to see if he or she was properly understood.
By way of contrast, larger tests are not exhaustive, and so can easily miss lots of bugs. They're bad at localizing bugs. They are usually performed at fairly long intervals, so they may detect a bug some time after it's made. They define parts of the total user experience, but provide no basis to reason about any part of the system. They should not be neglected, but they are not a substitute for unit tests.
As others have stated, the length of the feedback loop and isolation of the problem to a specific component are key benefits of Unit Tests.
Another way that they are complementary to functional tests is how coverage is tracked in some organizations:
Unit tests on code coverage
Functional tests on requirements coverage
Functional tests might miss features that were implemented but are not in the spec.
Being based on the code, Unit tests might miss that a certain feature wasn't implemented, which is where requirements based coverage analysis of Functional testing comes in.
A final point : there are some things that are easier/faster to test at the unit level, especially around error scenarios.
Unit testing will help you identify the source of your bug more clearly and let you know that you have a problem earlier. Both are good to have, but they are different, and unit testing does have benefits.
The software you test is a system. When you are testing it as a whole you are black box testing since you primarily deal with inputs and outputs. Black box testing is great when you have no means of getting inside of the system.
But since you usually do, you create a lot of unit tests that actually test your system as a white box. You can slice system open in many ways and organize your tests depending on system internal structure. White box testing provides you with many more ways of testing and analyzing systems. It's clearly complimentary to Black box testing and should not be considered as an alternative or competing methodology.