I want to unit test a signal generator - let's say it generates a simple sine wave, or does frequency modulation of a signal onto a sine wave. It's easy enough to define sensible test parameters, and it's well known what the output should "look like" - but this is quite hard to test.
I could do (eg) a frequency analysis on the output and check that, check the maximum amplitude etc, but a) this will make the test code significantly more complicated than the code it's testing and b) doesn't fully test the shape of the output.
Is there an established way to do this?
One way to do this would be to capture a "known good" output and compare bit-for-bit against that. As long as your algorithm is deterministic you should get the same output every time. You might have to recalibrate it occasionally if anything changes, but at least you'll know if it does change at all.
This situation is a strong argument for a modeling tool like Matlab, to generate and review a well understood test set automatically, as well as to provide an environment for automatic comparison and scoring. Especially for instances where combinatorial explosions of test variations take place, automation makes it possible and straight forward generate a huge dataset, locate problems, and pare back if needed to a representative qualification test set.
Often undervalued is the means to generate a large, extensive tests exercising both the requirements and the limits of the implementation of your design. Thinking about and designing those cases up front is also a huge advantage in introducing a clean, problem free system.
One possible semi-automated way of testing is to code up your signal generators from spec by 3 different algorithms, or perhaps by 3 different programmers in 3 different programming languages. Then randomly generate parameters within the complete range of legal control input values and capture and compare the outputs of all 3 generators to see if they agree within some error bound. You could also include some typical and some suspected worse case parameters. If the outputs always agree, then there's a much higher probability that everything works per spec than if they don't.
Related
I'm working in a project where I need to generate Poisson, Normal, etc. variables from scratch. I know there are implementations in Python. I'm used to writing tests for almost everything I code.
I'm wondering what would be a good practice (if any) to test those functions?
I assume that your implementation is built on top of a uniform-distribution pseudonumber generator which you trust to be good enough (Not only the distribution of the generated values, but also the randomness of their order - see Diehard tests).
You should build two histograms: The first, based on values generated by your implementation. The second, based on a trusted implementation, or better - based on a maximum-likelihood estimate of the value count in each histogram column of the given distribution.
Next, you can verify that the counts match, for all histogram columns, using a tight confidence interval.
What I've done in similar circumstances is a) write a simple histogram routine that plots a histogram of samples, and run it on a few thousand samples to eyeball it; and b) test some key statistics - standard deviation, mean, ... to see that they behave as they should.
You could at the very least assert that the returned value is not null and in the range you expect. That still ensures that the methods at least run and don't error out and that they pass a basic sanity check.
You could also gather many values, and assert that you get somewhere close to the expected distribution of values but that would take more work.
I'm convinced that software testing indeed is very important, especially in science. However, over the last 6 years, I never have come across any scientific software project which was under regular tests (and most of them were not even version controlled).
Now I'm wondering how you deal with software tests for scientific codes (numerical computations).
From my point of view, standard unit tests often miss the point, since there is no exact result, so using assert(a == b) might prove a bit difficult due to "normal" numerical errors.
So I'm looking forward to reading your thoughts about this.
I am also in academia and I have written quantum mechanical simulation programs to be executed on our cluster. I made the same observation regarding testing or even version control. I was even worse: in my case I am using a C++ library for my simulations and the code I got from others was pure spaghetti code, no inheritance, not even functions.
I rewrote it and I also implemented some unit testing. You are correct that you have to deal with the numerical precision, which can be different depending on the architecture you are running on. Nevertheless, unit testing is possible, as long as you are taking these numerical rounding errors into account. Your result should not depend on the rounding of the numerical values, otherwise you would have a different problem with the robustness of your algorithm.
So, to conclude, I use unit testing for my scientific programs, and it really makes one more confident about the results, especially with regards to publishing the data in the end.
Just been looking at a similar issue (google: "testing scientific software") and came up with a few papers that may be of interest. These cover both the mundane coding errors and the bigger issues of knowing if the result is even right (depth of the Earth's mantle?)
http://http.icsi.berkeley.edu/ftp/pub/speech/papers/wikipapers/cox_harris_testing_numerical_software.pdf
http://www.cs.ua.edu/~SECSE09/Presentations/09_Hook.pdf (broken link; new link is http://www.se4science.org/workshops/secse09/Presentations/09_Hook.pdf)
http://www.associationforsoftwaretesting.org/?dl_name=DianeKellyRebeccaSanders_TheChallengeOfTestingScientificSoftware_paper.pdf
I thought the idea of mutation testing described in 09_Hook.pdf (see also matmute.sourceforge.net) is particularly interesting as it mimics the simple mistakes we all make. The hardest part is to learn to use statistical analysis for confidence levels, rather than single pass code reviews (man or machine).
The problem is not new. I'm sure I have an original copy of "How accurate is scientific software?" by Hatton et al Oct 1994, that even then showed how different implementations of the same theories (as algorithms) diverged rather rapidly (It's also ref 8 in Kelly & Sanders paper)
--- (Oct 2019)
More recently Testing Scientific Software: A Systematic Literature Review
I'm also using cpptest for its TEST_ASSERT_DELTA. I'm writing high-performance numerical programs in computational electromagnetics and I've been happily using it in my C++ programs.
I typically go about testing scientific code the same way as I do with any other kind of code, with only a few retouches, namely:
I always test my numerical codes for cases that make no physical sense and make sure the computation actually stops before producing a result. I learned this the hard way: I had a function that was computing some frequency responses, then supplied a matrix built with them to another function as arguments which eventually gave its answer a single vector. The matrix could have been any size depending on how many terminals the signal was applied to, but my function was not checking if the matrix size was consistent with the number of terminals (2 terminals should have meant a 2 x 2 x n matrix); however, the code itself was wrapped so as not to depend on that, it didn't care what size the matrices were since it just had to do some basic matrix operations on them. Eventually, the results were perfectly plausible, well within the expected range and, in fact, partially correct -- only half of the solution vector was garbled. It took me a while to figure. If your data looks correct, it's assembled in a valid data structure and the numerical values are good (e.g. no NaNs or negative number of particles) but it doesn't make physical sense, the function has to fail gracefully.
I always test the I/O routines even if they are just reading a bunch of comma-separated numbers from a test file. When you're writing code that does twisted math, it's always tempting to jump into debugging the part of the code that is so math-heavy that you need a caffeine jolt just to understand the symbols. Days later, you realize you are also adding the ASCII value of \n to your list of points.
When testing for a mathematical relation, I always test it "by the book", and I also learned this by example. I've seen code that was supposed to compare two vectors but only checked for equality of elements and did not check for equality of length.
Please take a look at the answers to the SO question How to use TDD correctly to implement a numerical method?
I have made a quite few genetic algorithms; they work (they find a reasonable solution quickly). But I have now discovered TDD. Is there a way to write a genetic algorithm (which relies heavily on random numbers) in a TDD way?
To pose the question more generally, How do you test a non-deterministic method/function. Here is what I have thought of:
Use a specific seed. Which wont help if I make a mistake in the code in the first place but will help finding bugs when refactoring.
Use a known list of numbers. Similar to the above but I could follow the code through by hand (which would be very tedious).
Use a constant number. At least I know what to expect. It would be good to ensure that a dice always reads 6 when RandomFloat(0,1) always returns 1.
Try to move as much of the non-deterministic code out of the GA as possible. which seems silly as that is the core of it's purpose.
Links to very good books on testing would be appreciated too.
Seems to me that the only way to test its consistent logic is to apply consistent input, ... or treat each iteration as a single automaton whose state is tested before and after that iteration, turning the overall nondeterministic system into testable components based on deterministic iteration values.
For variations/breeding/attribute inheritance in iterations, test those values on the boundaries of each iteration and test the global output of all iterations based on known input/output from successful iteration-subtests ...
Because the algorithm is iterative you can use induction in your testing to ensure it works for 1 iteration, n+1 iterations to prove it will produce correct results (regardless of data determinism) for a given input range/domain and the constraints on possible values in the input.
Edit I found this strategies for testing nondeterministic systems which might provide some insight. It might be helpful for statistical analysis of live results once the TDD/development process proves the logic is sound.
I would test random functions by testing them a number of times and analyzing whether the distribution of return values meets the statistical expectations (this involves some statistical knowledge).
If you're talking TDD, I would say definitely start out by picking a constant number and growing your test suite from there. I've done TDD on a few highly mathematical problems and it helps to have a few constant cases you know and have worked out by hand to run with from the beginning.
W/R/T your 4th point, moving nondeterministic code out of the GA, I think this is probably an approach worth considering. If you can decompose the algorithm and separate the nondeterministic concerns, it should make testing the deterministic parts straightforward. As long as you're careful about how you name things I don't think that you're sacrificing much here. Unless I am misunderstanding you, the GA will still delegate to this code, but it lives somewhere else.
As far as links to very good books on (developer) testing my favorites are:
Test Driven by Lasse Kosela
Working Effectively with Legacy Code by Michael Feathers
XUnit Test Patterns by Gerard Meszaros
Next Generation Java™ Testing: TestNG and Advanced Concepts by Cédric Beust & Hani Suleiman
One way I do for unit testing of non-deterministic functions of GA algorithms is put the election of random numbers in a different function of the logic one that uses that random numbers.
For example, if you have a function that takes a gene (vector of something) and takes two random points of the gene to do something with them (mutation or whatever), you can put the generation of the random numbers in a function, and then pass them along with the gene to another function that contains the logic given that numbers.
This way you can do TDD with the logic function and pass it certain genes and certain numbers, knowing exactly what the logic should do on the gene given that numbers and being able to write asserts on the modified gene.
Another way, to test with the generation of random numbers is externalizing that generation to another class, that could be accessed via a context or loaded from a config value, and using a different one for test executions. There would be two implementations of that class, one for production that generates actual random numbers, and another for testing, that would have ways to accept the numbers that later it will generate. Then in the test you could provide that certain numbers that the class will supply to the tested code.
You could write a redundant neural network to analyze the results from your algorithm and have the output ranked based on expected outcomes. :)
Break your method down as much as your can. Then you can also have a unit test around just the random part to check the range of values. Even have the test run it a few times to see if the result changes.
All of your functions should be completely deterministic. This means that none of the functions you are testing should generate the random number inside the function itself. You will want to pass that in as a parameter. That way when your program is making decisions based on your random numbers, you can pass in representative numbers to test the expected output for that number. The only thing that shouldn't be deterministic is your actual random number generator, which you don't really need to worry too much about because you shouldn't be writing this yourself. You should be able to just assume it works as long as its an established library.
That's for your unit tests. For your integration tests, if you are doing that, you might look into mocking your random number generation, replacing it with an algorithm that will return known numbers from 0..n for every random number that you need to generate.
I wrote a C# TDD Genetic Algorithm didactic application:
http://code.google.com/p/evo-lisa-clone/
Let's take the simplest random result method in the application: PointGenetics.Create, which creates a random point, given the boundaries. For this method I used 5 tests, and none of them relies on a specific seed:
http://code.google.com/p/evo-lisa-clone/source/browse/trunk/EvoLisaClone/EvoLisaCloneTest/PointGeneticsTest.cs
The randomness test is simple: for a large boundary (many possibilities), two consecutive generated points should not be equal. The remaining tests check other constraints.
Well the most testable part is the fitness function - where all your logic will be. this can be in some cases quite complex (you might be running all sorts of simulations based on input parameters) so you wanna be sure all that stuff works with a whole lot of unit tests, and this work can follow whatever methodology.
With regards to testing the GA parameters (mutation rate, cross-over strategy, whatever) if you're implementing that stuff yourself you can certainly test it (you can again have unit tests around mutation logic etc.) but you won't be able to test the 'fine-tuning' of the GA.
In other words, you won't be able to test if GA actually performs other than by the goodness of the solutions found.
A test that the algorithm gives you the same result for the same input could help you but sometimes you will make changes that change the result picking behavior of the algorithm.
I would make the most effort to have a test that ensures that the algorithm gives you a correct result. If the algorithm gives you a correct result for a number of static seeds and random values the algorithm works or is not broken through the changes made.
Another chance in TDD is the possibility to evaluate the algorithm. If you can automatically check how good a result is you could add tests that show that a change hasn't lowered the qualities of your results or increased your calculating time unreasonable.
If you want to test your algorithm with many base seeds you maybe want to have to test suits one suit that runs a quick test for starting after every save to ensure that you haven't broken anything and one suit that runs for a longer time for a later evaluation
I would highly suggest looking into using mock objects for your unit test cases (http://en.wikipedia.org/wiki/Mock_object). You can use them to mock out objects that make random guesses in order to cause you to get expected results instead.
Imagine that you have an internally controlled list of vendors. Now imagine that you want to match unstructured strings against that list. Most will be easy to match, but some may be reasonably impossible. The algorithm will assign a confidence to each match, but a human needs to confirm all matches produced.
How could this algorithm be unit tested? The only idea I have had so far is to take a sample of pairs matched by humans and make sure the algorithm is able to successfully match those, omitting strings that I couldn't reasonably expect our algorithm to handle. Is there a better way?
i'd try some 'canonical' pairs, both "should match" and "shouldn't match" pairs, and test only if the confidence is above (or below) a given threshold.
maybe you can also do some ordering checks, such as "no pair should have greater confidence than the one from the exact match pair", or "the pair that matches all consonants should be >= the only vowels one".
You can also test if the confidence of strings your algorithm won't handle well is sufficiently low. In this way you can see if there is a threshold over which you can trust your algorithm as well.
An interesting exercise would be to store the human answers that correct your algorithm and try to see if you could improve your algorithm to not get them wrong.
If you can, add the new matches to the unit tests.
I don't think there's a better way than what you describe; effectively, you're just using a set of predefined data to test that the algorithm does what you expect. For any very complicated algorithm which has very nonlinear inputs and outputs, that's about the best you can do; choose a good test set, and assure that you run properly against that set of known values. If other values come up which need to be tested in the future, you can add them to the set of tested values.
That sound fair. If it's possible (given time constraints) get as large of a sample of human matches as possible, you could get a picture of how well your algorithm is doing. You could design specific unit tests which pass if they're within X% of correctness.
Best of luck.
I think there are two issues here: The way your code behaves according to the algorithm, and the way the algorithm is successful (i.e does not accept answers which a human later rejects, and does not reject answers a human would accept).
Issue 1 is regular testing. Issue 2 I would go with previous result sets (i.e. compare the algorithm's results to human ones).
What you describe is the best way because it is subjective what is the best match, only a human can come up with the appropriate test cases.
It sounds as though you are describing an algorithm which is deterministic, but one which is sufficiently difficult that your best initial guess at the correct result is going to be whatever your current implementation delivers to you (aka deterministic implementation to satisfy fuzzy requirements).
For those sorts of circumstances, I will use a "Guru Checks Changes" pattern. Generate a collection of inputs, record the outputs, and in subsequent runs of the unit tests, verify that the outcome is consistent with the previous results. Not so great for ensuring that the target algorithm is implemented correctly, but it is effective for ensuring that the most recent refactoring hasn't changed the behavior in the test space.
A variation of this - which may be more palatable for your circumstance, is to start from the same initial data collection, but rather than trying to preserve precisely the same result every time you instead predefine some buckets, and flag any time an implementation change moves a test result from one confidence bucket to another.
Samples that have clearly correct answers (exact matches, null matches, high value corner cases) should be kept in a separate test.
Let's say we have a simple function defined in a pseudo language.
List<Numbers> SortNumbers(List<Numbers> unsorted, bool ascending);
We pass in an unsorted list of numbers and a boolean specifying ascending or descending sort order. In return, we get a sorted list of numbers.
In my experience, some people are better at capturing boundary conditions than others. The question is, "How do you know when you are 'done' capturing test cases"?
We can start listing cases now and some clever person will undoubtedly think of 'one more' case that isn't covered by any of the previous.
Don't waste too much time trying to think of every boundry condition. Your tests won't be able to catch every bug first time around. The idea is to have tests that are pretty good, and then each time a bug does surface, write a new test specifically for that bug so that you never hear from it again.
Another note I want to make about code coverage tools. In a language like C# or Java where your have many get/set and similar methods, you should not be shooting for 100% coverage. That means you are wasting too much time writing tests for trivial code. You only want 100% coverage on your complex business logic. If your full codebase is closer to 70-80% coverage, you are doing a good job. If your code coverage tool allows multiple coverage metrics, the best one is 'block coverage' which measures coverage of 'basic blocks'. Other types are class and method coverage (which don't give you as much information) and line coverage (which is too fine grain).
How do you know when you are 'done' capturing test cases?
You don't.You can't get to 100% except for the most trivial cases. Also 100% coverage (of lines, paths, conditions...) doesn't tell you you've hit all boundary conditions.
Most importantly, the test cases are not write-and-forget. Each time you find a bug, write an additional test. Check it fails with the original program, check it passes with the corrected program and add it to your test set.
An excerpt from The Art of Software Testing by Glenford J. Myers:
If an input condition specifies a range of values, write test cases for the ends of the range, and invalid-input test cases for situations just beyond the ends.
If an input condition specifies a number of values, write test cases for the minimum and maximum number of values and one beneath and beyond these values.
Use guideline 1 for each output condition.
Use guideline 2 for each output condition.
If the input or output of a program is an ordered set focus attention on the first and last elements of the set.
In addition, use your ingenuity to search for other boundary conditions
(I've only pasted the bare minimum for copyright reasons.)
Points 3. and 4. above are very important. People tend to forget boundary conditions for the outputs. 5. is OK. 6. really doesn't help :-)
Short exam
This is more difficult than it looks. Myers offers this test:
The program reads three integer values from an input dialog. The three values represent the lengths of the sides of a triangle. The program displays a message that states whether the triangle is scalene, isosceles, or equilateral.
Remember that a scalene triangle is one where no two sides are equal, whereas an isosceles triangle has two equal sides, and an equilateral triangle has three sides of equal length. Moreover, the angles opposite the equal sides in an isosceles triangle also are equal (it also follows that the sides opposite equal angles in a triangle are equal), and all angles in an equilateral triangle are equal.
Write your test cases. How many do you have? Myers asks 14 questions about your test set and reports that highly qualified professional programmes average 7.8 out of a possible 14.
From a practical standpoint, I create a list of tests that I believe must pass prior to acceptance. I test these and automate where possible. Based on how much time I've estimated for the task or how much time I've been given, I extend my test coverage to include items that should pass prior to acceptance. Of course, the line between must and should is subjective. After that, I update automated tests as bugs are discovered.
#Keith
I think you nailed it, code coverage is important to look at if you want to see how "done" you are, but I think 100% is a bit unrealistic a goal. Striving for 75-90% will give you pretty good coverage without going overboard... don't test for the pure sake of hitting 100%, because at that point you are just wasting your time.
A good code coverage tool really helps.
100% coverage doesn't mean that it definitely is adequately tested, but it's a good indicator.
For .Net NCover's quite good, but is no longer open source.
#Mike Stone -
Yeah, perhaps that should have been "high coverage" - we aim for 80% minimum, past about 95% it's usually diminishing returns, especially if you have belt 'n' braces code.