I'm working in a project where I need to generate Poisson, Normal, etc. variables from scratch. I know there are implementations in Python. I'm used to writing tests for almost everything I code.
I'm wondering what would be a good practice (if any) to test those functions?
I assume that your implementation is built on top of a uniform-distribution pseudonumber generator which you trust to be good enough (Not only the distribution of the generated values, but also the randomness of their order - see Diehard tests).
You should build two histograms: The first, based on values generated by your implementation. The second, based on a trusted implementation, or better - based on a maximum-likelihood estimate of the value count in each histogram column of the given distribution.
Next, you can verify that the counts match, for all histogram columns, using a tight confidence interval.
What I've done in similar circumstances is a) write a simple histogram routine that plots a histogram of samples, and run it on a few thousand samples to eyeball it; and b) test some key statistics - standard deviation, mean, ... to see that they behave as they should.
You could at the very least assert that the returned value is not null and in the range you expect. That still ensures that the methods at least run and don't error out and that they pass a basic sanity check.
You could also gather many values, and assert that you get somewhere close to the expected distribution of values but that would take more work.
Related
I want to unit test a signal generator - let's say it generates a simple sine wave, or does frequency modulation of a signal onto a sine wave. It's easy enough to define sensible test parameters, and it's well known what the output should "look like" - but this is quite hard to test.
I could do (eg) a frequency analysis on the output and check that, check the maximum amplitude etc, but a) this will make the test code significantly more complicated than the code it's testing and b) doesn't fully test the shape of the output.
Is there an established way to do this?
One way to do this would be to capture a "known good" output and compare bit-for-bit against that. As long as your algorithm is deterministic you should get the same output every time. You might have to recalibrate it occasionally if anything changes, but at least you'll know if it does change at all.
This situation is a strong argument for a modeling tool like Matlab, to generate and review a well understood test set automatically, as well as to provide an environment for automatic comparison and scoring. Especially for instances where combinatorial explosions of test variations take place, automation makes it possible and straight forward generate a huge dataset, locate problems, and pare back if needed to a representative qualification test set.
Often undervalued is the means to generate a large, extensive tests exercising both the requirements and the limits of the implementation of your design. Thinking about and designing those cases up front is also a huge advantage in introducing a clean, problem free system.
One possible semi-automated way of testing is to code up your signal generators from spec by 3 different algorithms, or perhaps by 3 different programmers in 3 different programming languages. Then randomly generate parameters within the complete range of legal control input values and capture and compare the outputs of all 3 generators to see if they agree within some error bound. You could also include some typical and some suspected worse case parameters. If the outputs always agree, then there's a much higher probability that everything works per spec than if they don't.
Specifically, I've got a method picks n items from a list in such a way that a% of them meet one criterion, and b% meet a second, and so on. A simplified example would be to pick 5 items where 50% have a given property with the value 'true', and 50% 'false'; 50% of the time the method would return 2 true/3 false, and the other 50%, 3 true/2 false.
Statistically speaking, this means that over 100 runs, I should get about 250 true/250 false, but because of the randomness, 240/260 is entirely possible.
What's the best way to unit test this? I'm assuming that even though technically 300/200 is possible, it should probably fail the test if this happens. Is there a generally accepted tolerance for cases like this, and if so, how do you determine what that is?
Edit: In the code I'm working on, I don't have the luxury of using a pseudo-random number generator, or a mechanism of forcing it to balance out over time, as the lists that are picked out are generated on different machines. I need to be able to demonstrate that over time, the average number of items matching each criterion will tend to the required percentage.
Random and statistics are not favored in unit tests. Unit tests should always return the same result. Always. Not mostly.
What you could do is trying to remove the random generator of the logic you are testing. Then you can mock the random generator and return predefined values.
Additional thoughts:
You could consider to change the implementation to make it more testable. Try to get as less random values as possible. You could for instance only get one random value to determine the deviation from the average distribution. This would be easy to test. If the random value is zero, you should get the exact distribution you expect in average. If the value is for instance 1.0, you miss the average by some defined factor, for instance by 10%. You could also implement some Gaussian distribution etc. I know this is not the topic here, but if you are free to implement it as you want, consider testability.
According to the Statistical information you have, determine a range instead of a particular single value as a result.
Many probabilistic algorithms in e.g. scientific computing use pseudo-random number generators, instead of a true random number generator. Even though they're not truly random, a carefully chosen pseudo-random number generator will do the job just fine.
One advantage of a pseudo-random number generator is that the random number sequence they produce is fully reproducible. Since the algorithm is deterministic, the same seed would always generate the same sequence. This is often the deciding factor why they're chosen in the first place, because experiments need to be repeatable, results reproducible.
This concept is also applicable for testing. Components can be designed such that you can plug in any source of random numbers. For testing, you can then use generators that are consistently seeded. The result would then be repeatable, which is suitable for testing.
Note that if in fact a true random number is needed, you can still test it this way, as long as the component features a pluggable source of random numbers. You can re-plug in the same sequence (which may be truly random if need be) to the same component for testing.
It seems to me there are at least three distinct things you want to test here:
The correctness of the procedure that generates an output using the random source
That the distribution of the random source is what you expect
That the distribution of the output is what you expect
1 should be deterministic and you can unit test it by supplying a chosen set of known "random" values and inputs and checking that it produces the known correct outputs. This would be easiest if you structure the code so that the random source is passed as an argument rather than embedded in the code.
2 and 3 cannot be tested absolutely. You can test to some chosen confidence level, but you must be prepared for such tests to fail in some fraction of cases. Probably the thing you really want to look out for is test 3 failing much more often than test 2, since that would suggest that your algorithm is wrong.
The tests to apply will depend on the expected distribution. For 2 you most likely expect the random source to be uniformly distributed. There are various tests for this, depending on how involved you want to be, see for example Tests for pseudo-random number generators on this page.
The expected distribution for 3 will depend very much on exactly what you're producing. The simple 50-50 case in the question is exactly equivalent to testing for a fair coin, but obviously other cases will be more complicated. If you can work out what the distribution should be, a chi-square test against it may help.
That depends on the use you make of your test suite. If you run it every few seconds because you embrace test-driven development and aggressive refactoring, then it is very important that it doesn't fail spuriously, because this causes major disruption and lowers productivity, so you should choose a threshold that is practically impossible to reach for a well-behaved implementation. If you run your tests once a night and have some time to investigate failures you can be much stricter.
Under no circumstances should you deploy something that will lead to frequent uninvestigated failures - this defeats the entire purpose of having a test suite, and dramatically reduces its value to the team.
You should test the distribution of results in a "single" unit test, i.e. that the result is as close to the desired distribution as possible in any individual run. For your example, 2 true / 3 false is OK, 4 true / 1 false is not OK as a result.
Also you could write tests which execute the method e.g. 100 times and checks that the average of the distributions is "close enough" to the desired rate. This is a borderline case - running bigger batches may take a significant amount of time, so you might want to run these tests separately from your "regular" unit tests. Also, as Stefan Steinegger points out, such a test is going to fail every now and then if you define "close enough" stricter, or start being meaningless if you define the threshold too loosely. So it is a tricky case...
I think if I had the same problem I probably construct a confidence interval to detect anomalies if you have some statistics about average/stddev and such. So in your case if the average expected value is 250 then create a 95% confidence interval around the average using a normal distribution. If the results are outside that interval you fail the test.
see more
Why not re-factor the random number generation code and let the unit test framework and the source code both use it? You are trying to test your algorithm and not the randomized sequence right?
First you have to know what distribution should result from your random number generation process. In your case you are generating a result which is either 0 or 1 with probability -0.5. This describes a binomial distribution with p=0.5.
Given the sample size of n, you can construct (as an earlier poster suggested) a confidence interval around the mean. You can also make various statements about the probability of getting, for instance, 240 or less of either outcome when n=500.
You could use a normal distribution assumption for values of N greater than 20 as long as p is not very large or very small. The Wikipedia post has more on this.
I have made a quite few genetic algorithms; they work (they find a reasonable solution quickly). But I have now discovered TDD. Is there a way to write a genetic algorithm (which relies heavily on random numbers) in a TDD way?
To pose the question more generally, How do you test a non-deterministic method/function. Here is what I have thought of:
Use a specific seed. Which wont help if I make a mistake in the code in the first place but will help finding bugs when refactoring.
Use a known list of numbers. Similar to the above but I could follow the code through by hand (which would be very tedious).
Use a constant number. At least I know what to expect. It would be good to ensure that a dice always reads 6 when RandomFloat(0,1) always returns 1.
Try to move as much of the non-deterministic code out of the GA as possible. which seems silly as that is the core of it's purpose.
Links to very good books on testing would be appreciated too.
Seems to me that the only way to test its consistent logic is to apply consistent input, ... or treat each iteration as a single automaton whose state is tested before and after that iteration, turning the overall nondeterministic system into testable components based on deterministic iteration values.
For variations/breeding/attribute inheritance in iterations, test those values on the boundaries of each iteration and test the global output of all iterations based on known input/output from successful iteration-subtests ...
Because the algorithm is iterative you can use induction in your testing to ensure it works for 1 iteration, n+1 iterations to prove it will produce correct results (regardless of data determinism) for a given input range/domain and the constraints on possible values in the input.
Edit I found this strategies for testing nondeterministic systems which might provide some insight. It might be helpful for statistical analysis of live results once the TDD/development process proves the logic is sound.
I would test random functions by testing them a number of times and analyzing whether the distribution of return values meets the statistical expectations (this involves some statistical knowledge).
If you're talking TDD, I would say definitely start out by picking a constant number and growing your test suite from there. I've done TDD on a few highly mathematical problems and it helps to have a few constant cases you know and have worked out by hand to run with from the beginning.
W/R/T your 4th point, moving nondeterministic code out of the GA, I think this is probably an approach worth considering. If you can decompose the algorithm and separate the nondeterministic concerns, it should make testing the deterministic parts straightforward. As long as you're careful about how you name things I don't think that you're sacrificing much here. Unless I am misunderstanding you, the GA will still delegate to this code, but it lives somewhere else.
As far as links to very good books on (developer) testing my favorites are:
Test Driven by Lasse Kosela
Working Effectively with Legacy Code by Michael Feathers
XUnit Test Patterns by Gerard Meszaros
Next Generation Java™ Testing: TestNG and Advanced Concepts by Cédric Beust & Hani Suleiman
One way I do for unit testing of non-deterministic functions of GA algorithms is put the election of random numbers in a different function of the logic one that uses that random numbers.
For example, if you have a function that takes a gene (vector of something) and takes two random points of the gene to do something with them (mutation or whatever), you can put the generation of the random numbers in a function, and then pass them along with the gene to another function that contains the logic given that numbers.
This way you can do TDD with the logic function and pass it certain genes and certain numbers, knowing exactly what the logic should do on the gene given that numbers and being able to write asserts on the modified gene.
Another way, to test with the generation of random numbers is externalizing that generation to another class, that could be accessed via a context or loaded from a config value, and using a different one for test executions. There would be two implementations of that class, one for production that generates actual random numbers, and another for testing, that would have ways to accept the numbers that later it will generate. Then in the test you could provide that certain numbers that the class will supply to the tested code.
You could write a redundant neural network to analyze the results from your algorithm and have the output ranked based on expected outcomes. :)
Break your method down as much as your can. Then you can also have a unit test around just the random part to check the range of values. Even have the test run it a few times to see if the result changes.
All of your functions should be completely deterministic. This means that none of the functions you are testing should generate the random number inside the function itself. You will want to pass that in as a parameter. That way when your program is making decisions based on your random numbers, you can pass in representative numbers to test the expected output for that number. The only thing that shouldn't be deterministic is your actual random number generator, which you don't really need to worry too much about because you shouldn't be writing this yourself. You should be able to just assume it works as long as its an established library.
That's for your unit tests. For your integration tests, if you are doing that, you might look into mocking your random number generation, replacing it with an algorithm that will return known numbers from 0..n for every random number that you need to generate.
I wrote a C# TDD Genetic Algorithm didactic application:
http://code.google.com/p/evo-lisa-clone/
Let's take the simplest random result method in the application: PointGenetics.Create, which creates a random point, given the boundaries. For this method I used 5 tests, and none of them relies on a specific seed:
http://code.google.com/p/evo-lisa-clone/source/browse/trunk/EvoLisaClone/EvoLisaCloneTest/PointGeneticsTest.cs
The randomness test is simple: for a large boundary (many possibilities), two consecutive generated points should not be equal. The remaining tests check other constraints.
Well the most testable part is the fitness function - where all your logic will be. this can be in some cases quite complex (you might be running all sorts of simulations based on input parameters) so you wanna be sure all that stuff works with a whole lot of unit tests, and this work can follow whatever methodology.
With regards to testing the GA parameters (mutation rate, cross-over strategy, whatever) if you're implementing that stuff yourself you can certainly test it (you can again have unit tests around mutation logic etc.) but you won't be able to test the 'fine-tuning' of the GA.
In other words, you won't be able to test if GA actually performs other than by the goodness of the solutions found.
A test that the algorithm gives you the same result for the same input could help you but sometimes you will make changes that change the result picking behavior of the algorithm.
I would make the most effort to have a test that ensures that the algorithm gives you a correct result. If the algorithm gives you a correct result for a number of static seeds and random values the algorithm works or is not broken through the changes made.
Another chance in TDD is the possibility to evaluate the algorithm. If you can automatically check how good a result is you could add tests that show that a change hasn't lowered the qualities of your results or increased your calculating time unreasonable.
If you want to test your algorithm with many base seeds you maybe want to have to test suits one suit that runs a quick test for starting after every save to ensure that you haven't broken anything and one suit that runs for a longer time for a later evaluation
I would highly suggest looking into using mock objects for your unit test cases (http://en.wikipedia.org/wiki/Mock_object). You can use them to mock out objects that make random guesses in order to cause you to get expected results instead.
Imagine that you have an internally controlled list of vendors. Now imagine that you want to match unstructured strings against that list. Most will be easy to match, but some may be reasonably impossible. The algorithm will assign a confidence to each match, but a human needs to confirm all matches produced.
How could this algorithm be unit tested? The only idea I have had so far is to take a sample of pairs matched by humans and make sure the algorithm is able to successfully match those, omitting strings that I couldn't reasonably expect our algorithm to handle. Is there a better way?
i'd try some 'canonical' pairs, both "should match" and "shouldn't match" pairs, and test only if the confidence is above (or below) a given threshold.
maybe you can also do some ordering checks, such as "no pair should have greater confidence than the one from the exact match pair", or "the pair that matches all consonants should be >= the only vowels one".
You can also test if the confidence of strings your algorithm won't handle well is sufficiently low. In this way you can see if there is a threshold over which you can trust your algorithm as well.
An interesting exercise would be to store the human answers that correct your algorithm and try to see if you could improve your algorithm to not get them wrong.
If you can, add the new matches to the unit tests.
I don't think there's a better way than what you describe; effectively, you're just using a set of predefined data to test that the algorithm does what you expect. For any very complicated algorithm which has very nonlinear inputs and outputs, that's about the best you can do; choose a good test set, and assure that you run properly against that set of known values. If other values come up which need to be tested in the future, you can add them to the set of tested values.
That sound fair. If it's possible (given time constraints) get as large of a sample of human matches as possible, you could get a picture of how well your algorithm is doing. You could design specific unit tests which pass if they're within X% of correctness.
Best of luck.
I think there are two issues here: The way your code behaves according to the algorithm, and the way the algorithm is successful (i.e does not accept answers which a human later rejects, and does not reject answers a human would accept).
Issue 1 is regular testing. Issue 2 I would go with previous result sets (i.e. compare the algorithm's results to human ones).
What you describe is the best way because it is subjective what is the best match, only a human can come up with the appropriate test cases.
It sounds as though you are describing an algorithm which is deterministic, but one which is sufficiently difficult that your best initial guess at the correct result is going to be whatever your current implementation delivers to you (aka deterministic implementation to satisfy fuzzy requirements).
For those sorts of circumstances, I will use a "Guru Checks Changes" pattern. Generate a collection of inputs, record the outputs, and in subsequent runs of the unit tests, verify that the outcome is consistent with the previous results. Not so great for ensuring that the target algorithm is implemented correctly, but it is effective for ensuring that the most recent refactoring hasn't changed the behavior in the test space.
A variation of this - which may be more palatable for your circumstance, is to start from the same initial data collection, but rather than trying to preserve precisely the same result every time you instead predefine some buckets, and flag any time an implementation change moves a test result from one confidence bucket to another.
Samples that have clearly correct answers (exact matches, null matches, high value corner cases) should be kept in a separate test.
Let's say we have a simple function defined in a pseudo language.
List<Numbers> SortNumbers(List<Numbers> unsorted, bool ascending);
We pass in an unsorted list of numbers and a boolean specifying ascending or descending sort order. In return, we get a sorted list of numbers.
In my experience, some people are better at capturing boundary conditions than others. The question is, "How do you know when you are 'done' capturing test cases"?
We can start listing cases now and some clever person will undoubtedly think of 'one more' case that isn't covered by any of the previous.
Don't waste too much time trying to think of every boundry condition. Your tests won't be able to catch every bug first time around. The idea is to have tests that are pretty good, and then each time a bug does surface, write a new test specifically for that bug so that you never hear from it again.
Another note I want to make about code coverage tools. In a language like C# or Java where your have many get/set and similar methods, you should not be shooting for 100% coverage. That means you are wasting too much time writing tests for trivial code. You only want 100% coverage on your complex business logic. If your full codebase is closer to 70-80% coverage, you are doing a good job. If your code coverage tool allows multiple coverage metrics, the best one is 'block coverage' which measures coverage of 'basic blocks'. Other types are class and method coverage (which don't give you as much information) and line coverage (which is too fine grain).
How do you know when you are 'done' capturing test cases?
You don't.You can't get to 100% except for the most trivial cases. Also 100% coverage (of lines, paths, conditions...) doesn't tell you you've hit all boundary conditions.
Most importantly, the test cases are not write-and-forget. Each time you find a bug, write an additional test. Check it fails with the original program, check it passes with the corrected program and add it to your test set.
An excerpt from The Art of Software Testing by Glenford J. Myers:
If an input condition specifies a range of values, write test cases for the ends of the range, and invalid-input test cases for situations just beyond the ends.
If an input condition specifies a number of values, write test cases for the minimum and maximum number of values and one beneath and beyond these values.
Use guideline 1 for each output condition.
Use guideline 2 for each output condition.
If the input or output of a program is an ordered set focus attention on the first and last elements of the set.
In addition, use your ingenuity to search for other boundary conditions
(I've only pasted the bare minimum for copyright reasons.)
Points 3. and 4. above are very important. People tend to forget boundary conditions for the outputs. 5. is OK. 6. really doesn't help :-)
Short exam
This is more difficult than it looks. Myers offers this test:
The program reads three integer values from an input dialog. The three values represent the lengths of the sides of a triangle. The program displays a message that states whether the triangle is scalene, isosceles, or equilateral.
Remember that a scalene triangle is one where no two sides are equal, whereas an isosceles triangle has two equal sides, and an equilateral triangle has three sides of equal length. Moreover, the angles opposite the equal sides in an isosceles triangle also are equal (it also follows that the sides opposite equal angles in a triangle are equal), and all angles in an equilateral triangle are equal.
Write your test cases. How many do you have? Myers asks 14 questions about your test set and reports that highly qualified professional programmes average 7.8 out of a possible 14.
From a practical standpoint, I create a list of tests that I believe must pass prior to acceptance. I test these and automate where possible. Based on how much time I've estimated for the task or how much time I've been given, I extend my test coverage to include items that should pass prior to acceptance. Of course, the line between must and should is subjective. After that, I update automated tests as bugs are discovered.
#Keith
I think you nailed it, code coverage is important to look at if you want to see how "done" you are, but I think 100% is a bit unrealistic a goal. Striving for 75-90% will give you pretty good coverage without going overboard... don't test for the pure sake of hitting 100%, because at that point you are just wasting your time.
A good code coverage tool really helps.
100% coverage doesn't mean that it definitely is adequately tested, but it's a good indicator.
For .Net NCover's quite good, but is no longer open source.
#Mike Stone -
Yeah, perhaps that should have been "high coverage" - we aim for 80% minimum, past about 95% it's usually diminishing returns, especially if you have belt 'n' braces code.