Recommender systems, association rule mining for binary characterisation data - data-mining

I am currently having a dataset with several test cases which describes passes of tests (1), failures of tests (0) and in case that a test case has never been done (None).
My aim is to determine the correlations/associations so that I can provide some insights on which parts of the data, the more frequent failures occur as well as for passes.
I came across the apriori/fpgrowth algorithms, although they are suitable to determine frequencies as if a pass/failure was a single item in the dataset. Hence, it only computes the corresponding metrics based on the existence of pass test cases, neglecting the rest as 'non-existent' (Failures and None).
My question is if it is possible to use these algorithms with all the cases I am having on the dataset (eg. Pass, Failure, None). So that I can determine if there's influence/correlation between pass test cases and fail test cases.
I only came across with the idea of implementing the algorithm two separate times (one for pass test cases, and the other for fail) and obtain results for two separate cases. But I wasn't sure if this will be sensible enough for my final results.
Some help on this complicated problem would be very much appreciated. :)

Related

Unit testing a very specific function

To make sure this question is concrete enough for the standards in the FAQ, I am just asking the following: What are some sources that discuss the most common ways to apply unit testing to a very specific function, generally a function that relies on vendor data or other very specific data such that synthetic data is unhelpful in the test? If you're interested in more background, read below.
Background:
I write unit tests often in my daily code development, but I also try to make my code as abstract and reusable as possible. In a new project that I've joined, there are many cases where the code consists of very specific functions that are meant to accept very specifically formatted input data and store output data to database tables. Much of the input data consists of vendor data or other in-house data, and is accessed through calls to both vendor and in-house APIs.
The only idea I have so far is to test the kinds of failures hit upon when input data is poorly formatted. I will definitely write this test, but it's pretty useless for our team as far as tests go. Much more useful tests ought to check that the logic of these data manipulations is correct, which involves checking the accuracy of the output data based on the input data.
Unfortunately, I don't have any benchmark data sets where I definitively know what the output should be. Others have suggested to create my own synthetic input data (like a matrix of all 1's or something contrived where I can predict what the output should be). Unfortunately, the operations performed by the function are very non-linear (binning things by weighted percentiles and getting aggregate statistics over each percentile grouping). Any test of this that's based off of totally contrived synthetic input data won't be very useful for us either, and the time cost of formatting it and then writing to some synthetic output database table and reading it to check in the unit test kind of makes such a test worthless.
I know that unit tests should test for just one behavior. I'm just not sure how to break apart a function that does something like aggregating complicated statistics across weighted percentile groupings and boil that down to "just one thing" to test.
What are some standards used in this setting?
I've run into similar issues with very large methods. My advice would be to refactor the code utilizing Dependency Injection and adhering to the Single Responsibility Principle. Then test each class according to its responsibility.

Is it possible to use TDD with image processing algorithms?

Recently, I have worked in a project were TDD (Test Driven Development) was used. The project was a web application developed in Java and, although unit-testing web applications may not be trivial, it was possible using mocking (we have used the Mockito framework).
Now I will start a project where I will use C++ to work with image processing (mostly image segmentation) and I'm not sure whether using TDD is a good idea. The problem is that is very hard to tell whether the result of a segmentation is right or not, and the same problem applies to many other image processing algorithms.
So, what I would like to know is if someone here have successfully used TDD with image segmentation algorithms (not necessarily segmentation algorithms).
at a minimum you can use the tests for regression testing. For example, suppose you have 5 test images for a particular segmentation algorithm. You run the 5 images through the code and manually verify the results. The results, when correct, are stored on disk somewhere, and future executions of these tests compare the generated results to the stored results.
that way, if you ever make a breaking change, you'll catch it, but more importantly you only have to go through a (correct) manual test cycle once.
Whenever I do any computer-vision related development TDD is almost standard practice. You have images and something you want to measure. Step one is to hand-label a (large) subset of the images. This gives you test data. The process (for full correctness) is then to divide your test-set in two, a "development set" and a "verification set". You do repeated development cycles until your algorithm is accurate enough when applied to the development set. Then you verify the result on the veriication set (so that you're not overtraining on some weird aspect of your development set.
This is test driven development at its purest.
Note that you're testing two different things when developing heavily algorithm dependent software like this.
The regular bugs you'll get in your software. These can be tested using "normal" TDD techniques
The performance of your algorithm, for which you need a system outlined above.
A program can be bug free according to (1) but not quite according to (2). For example, a very simple image segmentation algorithm says: "the left half of the image is one segment, the right half is another segment. This program can be made bug free according to (1) quite easily. It is another matter entirely wether it satisfies your performance needs. Don't confuse the two aspects, and don't let one interfere with the other.
More specifically, I'd advice you to develop the algorithm first, buggy warts and all, and then use TDD with the algorithm (not the code!) and perhaps other requirements of the software as specification for a separate TDDevelopment process. Doing unit tests for small temporary helper functions deep within some reasonably complex algorithm under heavy development is a waste of time and effort.
TDD in image processing only makes sense for deterministic problems like:
image arithmetic
histogram generation
and so on..
However TDD is not suitable for feature extraction algorithms like:
edge detection
segmentation
corner detection
... since no algorithm can solve this kind of problems for all images perfectly.
I think the best you can do is test the simple, mathematically well-defined building blocks your algorithm consists of, like linear filters, morphological operations, FFT, wavelet transforms etc. These are often tricky enough to implement efficiently and correctly for all border cases so verifying them does make sense.
For an actual algorithm like image segmentation, TDD doesn't make much sense IMHO. I don't even think unit-tests make sense here. Sure, you can write tests, but those will always be extremely fragile. A typical image processing algorithm needs a few parameters that have to be adjusted for the desired results (a process that can't be automated, and can't be done before the algorithm is working). The results of a segmentation algorithm aren't well defined either, but your unit test can only test for some well-defined property. An algorithm can have that property without doing what you want, or the other way round, so your test result isn't very informative. Also, to test the results of a segmentation algorithm you need to write a lot of pretty hard code, while verifying the results visually is pretty easy and you have to do it anyway.
I think in a way it's similar to unit-testing user interfaces: Testing the actual well-defined functionality (e.g. when the user clicks this button, some item is added to this list and this label shows that text...) is relatively easy and can save a lot of work and debugging. But no test in the world will tell you if your UI is usable, understandable or pretty, because these things just aren't well defined.
we had some discussion on the very same "problem" with many remarks mentioned in your comments below those answers here.
We came to the end, that TDD in in computer vision / image processing (concerning the global goal of segmention, detection or sth like that) could be:
get an image/sequence that should be processed and create a test for that image: desired output and a metric to tell how far your result may differ from that "ground truth".
get another image/sequence for a different setting (different lighting, different objects or something like that), where your algorithm fails and write a test for that.
improve your algorithm in a way that it solves all previous tests.
go back to 2.
no idea whether this is applicable, creating the tests will be much more complex than in traditional TDD since it might be hard to define the allowed differences between your ground truth and your algorithm output.
Probably it's better to just use some QualityDrivenDevelopment where your changes just shouldnt make things "worse" (you again have to find a metric for that) than before.
Obiviously you still can use traditional unit testing for deterministic parts of those algorithms, but that's not the real problem of "TDD-in-signal-processing"
The image processing tests that you describe in your question take place at a much higher level than most of the tests that you will write using TDD.
In a true Test Driven Development process you will first write a failing test before adding any new functionality to your software, then write the code that causes the test to pass, rinse and repeat.
This process yields a large library of Unit Tests, sometimes with more LOC of tests than functional code!
Because your analytic algorithms have structured behavior, they would be an excellent match for a TDD approach.
But I think the question you are really asking is "how do I go about executing a suite of Integration Tests against fuzzy image processing software?" You might think I am splitting hairs, but this distinction between Unit Tests and Integration Tests really gets to the heart of what Test Driven Development means. The benefits of the TDD process come from the rich supporting fabric of Unit Tests more than anything else.
In your case I would compare the Integration Test suite to automated performance metrics against a web application. We want to accumulate a historical record of execution times, but we probably don't want to explicitly fail the build for a single poorly performing execution (which might have been affected by network congestion, disk I/O, whatever). You might set some loose tolerances around performance of your test suite and have the Continuous Integration server kick out daily reports that give you a high level overview of the performance of your algorithm.
I'd say TDD is much easier in such an application than in a web one. You have a completely deterministic algorithm you have to test. You don't have to worry about fuzzy stuff like user input and HTML rendering.
Your algorithm consists of a number of steps. Each of these steps can be tested. If you give them fixed, known input, they should yield fixed, known output. So write a test for that. You can't test that the algorithm "is correct" in general, but you can give it data for which you've already precomputed the correct result, so you can verify that it yields the correct output in that case.
I am not really into your problem, so I don't know its hot spots. However, the final result of your algorithm is hopefully deterministic, so you can perform functional testing on it. Of course, you will have to determine a "known good" result. I know of TDD performed on graphic libraries (VTK, to be precise). The comparison is done on the final result image, pixel by pixel. Without going in so much detail, if you have a known good result, you can perform an md5 of the test result and compare it against the md5 of the known-good.
For unit testing, I am pretty sure you can test individual routines. This will force you to have a very fine-grained development style.
Might want to take a look at this paper
If your goal is to optimize an algorithm rather than verifying correctness you need a metric. A good metric would measure the performance criteria underlying in your algorithm. For a segmentation algorithm this could be the sum of standard deviations of pixel data within each segment. Using the metric you can use threshold levels of acceptance or rank versions of the algorithm.
You can use a statistical approach where you have many examples and correct outcomes, and the test runs all of them and evaluates the algorithm on them. It then produces a single number that is the combined success rate of all of them.
This way you are less sensitive to specific failures and your test is more robust.
You can then use a threshold on the success rate to see if the test failed or not.

What can be alternative metrics to code coverage?

Code coverage is propably the most controversial code metric. Some say, you have to reach 80% code coverage, other say, it's superficial and does not say anything about your testing quality. (See Jon Limjap's good answer on "What is a reasonable code coverage % for unit tests (and why)?".)
People tend to measure everything. They need comparisons, benchmarks etc.
Project teams need a pointer, how good their testing is.
So what are alternatives to code coverage? What would be a good metric that says more than "I touched this line of code"?
Are there real alternatives?
If you are looking for some useful metrics that tell you about the quality (or lack there of) of your code, you should look into the following metrics:
Cyclomatic Complexity
This is a measure of how complex a method is.
Usually 10 and lower is good, 11-25 is poor, higher is terrible.
Nesting Depth
This is a measure of how many nested scopes are in a method.
Usually 4 and lower is good, 5-8 is poor, higher is terrible.
Relational Cohesion
This is a measure of how well related the types in a package or assembly are.
Relational cohesion is somewhat of a relative metric, but useful none the less.
Acceptable levels depends on the formula. Given the following:
R: number of relationships in package/assembly
N: number of types in package/assembly
H: Cohesion of relationship between types
Formula: H = (R+1)/N
Given the above formula, acceptable range is 1.5 - 4.0
Lack of Cohesion of Methods (LCOM)
This is a measure of how cohesive a class is.
Cohesion of a class is a measure of how many fields each method references.
Good indication of whether your class meets the Principal of Single Responsibility.
Formula: LCOM = 1 - (sum(MF)/M*F)
M: number of methods in class
F: number of instance fields in class
MF: number of methods in class accessing a particular instance field
sum(MF): the sum of MF over all instance fields
A class that is totally cohesive will have an LCOM of 0.
A class that is completely non-cohesive will have an LCOM of 1.
The closer to 0 you approach, the more cohesive, and maintainable, your class.
These are just some of the key metrics that NDepend, a .NET metrics and dependency mapping utility, can provide for you. I recently did a lot of work with code metrics, and these 4 metrics are the core key metrics that we have found to be most useful. NDepend offers several other useful metrics, however, including Efferent & Afferent coupling and Abstractness & Instability, which combined provide a good measure of how maintainable your code will be (and whether or not your in what NDepend calls the Zone of Pain or the Zone of Uselessness.)
Even if you are not working with the .NET platform, I recommend taking a look at the NDepend metrics page. There is a lot of useful information there that you might be able to use to calculate these metrics on whatever platform you develop on.
Crap4j is one fairly good metrics that I'm aware of...
Its a Java implementation of the Change Risk Analysis and Predictions software metric which combines cyclomatic complexity and code coverage from automated tests.
Bug metrics are also important:
Number of bugs coming in
Number of bugs resolved
To detect for instance if bugs are not resolved as fast as new come in.
What about watching the trend of code coverage during your project?
As it is the case with many other metrics a single number does not say very much.
For example it is hard to tell wether there is a problem if "we have a Checkstyle rules compliance of 78.765432%". If yesterday's compliance was 100%, we are definitely in trouble. If it was 50% yesterday, we are probably doing a good job.
I alway get nervous when code coverage has gotten lower and lower over time. There are cases when this is okay, so you cannot turn off your head when looking at charts and numbers.
BTW, sonar (http://sonar.codehaus.org/) is a great tool for watching trends.
Using code coverage on it's own is mostly pointless, it gives you only insight if you are looking for unnecessary code.
Using it together with unit-tests and aiming for 100% coverage will tell you that all the 'tested' parts (assumed it was all successfully too) work as specified in the unit-test.
Writing unit-tests from a technical design/functional design, having 100% coverage and 100% successful tests will tell you that the program is working like described in the documentation.
Now the only thing you need is good documentation, especially the functional design, a programmer should not write that unless (s)he is an expert of that specific field.
Scenario coverage.
I don't think you really want to have 100% code coverage. Testing say, simple getters and setters looks like a waste of time.
The code always runs in some context, so you may list as many scenarios as you can (depending on the problem complexity sometimes even all of them) and test them.
Example:
// parses a line from .ini configuration file
// e.g. in the form of name=value1,value2
List parseConfig(string setting)
{
(name, values) = split_string_to_name_and_values(setting, '=')
values_list = split_values(values, ',')
return values_list
}
Now, you have many scenarios to test. Some of them:
Passing correct value
List item
Passing null
Passing empty string
Passing ill-formated parameter
Passing string with with leading or ending comma e.g. name=value1, or name=,value2
Running just first test may give you (depending on the code) 100% code coverage. But you haven't considered all the posibilities, so that metric by itself doesn't tell you much.
Code Coverage is just an indicator and helps pointing out lines which are not executed at all in your tests, which is quite interesting. If you reach 80% code coverage or so, it starts making sense to look at the remaining 20% of lines to identify if you are missing some use case. If you see "aha, this line gets executed if I pass an empty vector" then you can actually write a test which passes an empty vector.
As an alternative I can think of, if you have a specs document with Use Cases and Functional Requirements, you should map the unit tests to them and see how many UC are covered by FR (of course it should be 100%) and how many FR are covered by UT (again, it should be 100%).
If you don't have specs, who cares? Anything that happens will be ok :)
How about (lines of code)/(number of test cases)? Not extremely meaningful (since it depends on LOC), but at least it's easy to calculate.
Another one could be (number of test cases)/(number of methods).
I wrote a blog post about why High Test Coverage Ratio is a Good Thing Anyway.
I agree that: when a portion of code is executed by tests, it doesn’t mean that the validity of the results produced by this portion of code is verified by tests.
But still, if you are heavily using contracts to check states validity during tests execution, high test coverage will mean a lot of verification anyway.
The value in code coverage is it gives you some idea of what has been exercised by tests.
The phrase "code coverage" is often used to mean statement coverage, e.g., "how much of my code (in lines) has been executed", but in fact there are over a hundred varieties of "coverage". These other versions of coverage try to provide a more sophisticated view what it means to exercise code.
For example, condition coverage measures how many of the separate elements of conditional expressions have been exercised. This is different than statement coverage. MC/DC
"modified condition/decision coverage" determines whether the elements of all conditional expressions have all been demonstrated to control the outcome of the conditional, and is required by the FAA for aircraft software. Path coverage meaures how many of the possible execution paths through your code have been exercised. This is a better measure than statement coverage, in that paths essentially represent different cases in the code. Which of these measures is best to use depends on how concerned you are about the effectiveness of your tests.
Wikipedia discusses many variations of test coverage reasonably well.
http://en.wikipedia.org/wiki/Code_coverage
As a rule of thumb, defect injection rates proportionally trail code yield and they both typically follow a Rayleigh distribution curve.
At some point your defect detection rate will peak and then start to diminish.
This apex represents 40% of discovered defects.
Moving forward with simple regression analysis you can estimate how many defects remain in your product at any point following the peak.
This is one component of Lawrence Putnam's model.
This hasn't been mentioned, but the amount of change in a given file of code or method (by looking at version control history) is interesting particularly when you're building up a test suite for poorly tested code. Focus your testing on the parts of the code you change a lot. Leave the ones you don't for later.
Watch out for a reversal of cause and effect. You might avoid changing untested code and you might tend to change tested code more.
SQLite is an extremely well-tested library, and you can extract all kinds of metrics from it.
As of version 3.6.14 (all statistics in the report are against that release of SQLite), the SQLite library consists of approximately 63.2 KSLOC of C code. (KSLOC means thousands of "Source Lines Of Code" or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 715 times as much test code and test scripts - 45261.5 KSLOC.
In the end, what always strikes me as the most significant is none of those possible metrics seem to be as important as the simple statement, "it meets all the requirements." (So don't lose sight of that goal in the process of achieving it.)
If you want something to judge a team's progress, then you have to lay down individual requirements. This gives you something to point to and say "this one's done, this one isn't". It's not linear (solving each requirement will require varying work), and the only way you can linearize it is if the problem has already been solved elsewhere (and thus you can quantize work per requirement).
I like revenue, sales numbers, profit. They are pretty good metrics of a code base.
Probably not only measuring the code covered (touched) by the unit tests but how good the assertions are.
One metric easy to implement is to measure the size of the Assert.AreEqual
You can create your own Assert implementation calling Assert.AreEqual and measuring the size of the object passed as second parameter.

Testing When Correctness is Poorly Defined?

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions.
However, a lot of the code I write is data mining code that basically looks for significant patterns in large datasets. Correct behavior in this case is often not well defined and depends on a lot of different inputs in ways that are not easy for a human to predict (i.e. the math can't reasonably be done by hand, which is why I'm using a computer to solve the problem in the first place). These inputs can be very complex, to the point where coming up with a reasonable test case is near impossible. Identifying the edge cases that are worth testing is extremely difficult. Sometimes the algorithm isn't even deterministic.
Usually, I do the best I can by using asserts for sanity checks and creating a small toy test case with a known pattern and informally seeing if the answer at least "looks reasonable", without it necessarily being objectively correct. Is there any better way to test these kinds of cases?
I think you just need to write unit tests based on small sets of data that will make sure that your code is doing exactly what you want it to do. If this gives you a reasonable data-mining algorithm is a separate issue, and I don't think it is possible to solve it by unit tests. There are two "levels" of correctness of your code:
Your code is correctly implementing the given data mining algorithm (this thing you should unit-test)
The data mining algorithm you implement is "correct" - solves the business problem. This is a quite open question, it probably depends both on some parameters of your algorithm as well as on the actual data (different algorithms work for different types of data).
When facing cases like this I tend to build one or more stub data sets that reflect the proper underlying complexities of the real-life data. I often do this together with the customer, to make sure I capture the essence of the complexities.
Then I can just codify these into one or more datasets that can be used as basis for making very specific unit tests (sometimes they're more like integration tests with stub data, but I don't think that's an important distinction). So while your algorithm may have "fuzzy" results for a "generic" dataset, these algorithms almost always have a single correct answer for a specific dataset.
Well, there are a few answers.
First of all, as you mentioned, take a small case study, and do the math by hand. Since you wrote the algorithm, you know what it's supposed to do, so you can do it in a limited case.
The other one is to break down every component of your program into testable parts.
If A calls B calls C calls D, and you know that A,B,C,D, all give the right answer, then you test A->B, B->C, and C->D, then you can be reasonably sure that A->D is giving the correct response.
Also, if there are other programs out there that do what you are looking to do, try and aquire their datasets. Or an opensource project that you could use test data against, and see if your application is giving similar results.
Another way to test datamining code is by taking a test set, and then introducing a pattern of the type you're looking for, and then test again, to see if it will separate out the new pattern from the old ones.
And, the tried and true, walk through your own code by hand and see if the code is doing what you meant it to do.
Really, the challenge here is this: because your application is meant to do a fuzzy, non-deterministic kind of task in a smart way, the very goal you hope to achieve is that the application becomes better than human beings at finding these patterns. That's great, powerful, and cool ... but if you pull it off, then it becomes very hard for any human beings to say, "In this case, the answer should be X."
In fact, ideally the computer would say, "Not really. I see why you think that, but consider these 4.2 terabytes of information over here. Have you read them yet? Based on those, I would argue that the answer should be Z."
And if you really succeeded in your original goal, the end user might sometimes say, "Zowie, you're right. That is a better answer. You found a pattern that is going to make us money! (or save us money, or whatever)."
If such a thing could never happen, then why are you asking the computer to detect these kinds of patterns in the first place?
So, the best thing I can think of is to let real life help you build up a list of test scenarios. If there ever was a pattern discovered in the past that did turn out to be valuable, then make a "unit test" that sees if your system discovers it when given similar data. I say "unit test" in quotes because it may be more like an integration test, but you may still choose to use NUnit or VS.Net or RSpec or whatever unit test tools you're using.
For some of these tests, you might somehow try to "mock" the 4.2 terabytes of data (you won't really mock the data, but at some higher level you'd mock some of the conclusions reached from that data). For others, maybe you have a "test database" with some data in it, from which you expect a set of patterns to be detected.
Also, if you can do it, it would be great if the system could "describe its reasoning" behind the patterns it detects. This would let the business user deliberate over the question of whether the application was right or not.
This is tricky. This sounds similar to writing tests around our text search engine. If you keep struggling, you'll figure something out:
Start with a small, simplified but reasonably representative data sample, and test basic behavior doing this
Rather than asserting that the output is exactly some answer, sometimes it's better to figure out what is important about it. For example, for our search engine, I didn't care so much about the exact order the documents were listed, as long as the three key ones were on the first page of results.
As you make a small, incremental change, figure out what the essence of it is and write a test for that. Even though the overall calculations take many inputs, individual changes to the codebase should be isolatable. For example, we found certain documents weren't being surfaced because of the presence of hyphens in some of the key words. We created tests that testing that this was behaving how we expected.
Look at tools like Fitness, which allow you to throw a large number of datasets at a piece of code and assert things about the results. This may be easier to understand than more traditional unit tests.
I've gone back to the product owner, saying "I can't understand how this will work. How will we know if it's right?" Maybe s/he can articulate the essence of the vaguely defined problem. This has worked really well for me many times, and I've talked people out of features because they couldn't be explained.
Be creative!
Ultimately, you have to decide what your program should be doing, and then test for that.

Testing with random inputs best practices

NOTE: I mention the next couple of paragraphs as background. If you just want a TL;DR, feel free to skip down to the numbered questions as they are only indirectly related to this info.
I'm currently writing a python script that does some stuff with POSIX dates (among other things). Unit testing these seems a little bit difficult though, since there's such a wide range of dates and times that can be encountered.
Of course, it's impractical for me to try to test every single date/time combination possible, so I think I'm going to try a unit test that randomizes the inputs and then reports what the inputs were if the test failed. Statisically speaking, I figure that I can achieve a bit more completeness of testing than I could if I tried to think of all potential problem areas (due to missing things) or testing all cases (due to sheer infeasability), assuming that I run it enough times.
So here are a few questions (mainly indirectly related to the above ):
What types of code are good candidates for randomized testing? What types of code aren't?
How do I go about determining the number of times to run the code with randomized inputs? I ask this because I want to have a large enough sample to determine any bugs, but don't want to wait a week to get my results.
Are these kinds of tests well suited for unit tests, or is there another kind of test that it works well with?
Are there any other best practices for doing this kind of thing?
Related topics:
Random data in unit tests?
I agree with Federico - randomised testing is counterproductive. If a test won't reliably pass or fail, it's very hard to fix it and know it's fixed. (This is also a problem when you introduce an unreliable dependency, of course.)
Instead, however, you might like to make sure you've got good data coverage in other ways. For instance:
Make sure you have tests for the start, middle and end of every month of every year between 1900 and 2100 (if those are suitable for your code, of course).
Use a variety of cultures, or "all of them" if that's known.
Try "day 0" and "one day after the end of each month" etc.
In short, still try a lot of values, but do so programmatically and repeatably. You don't need every value you try to be a literal in a test - it's fine to loop round all known values for one axis of your testing, etc.
You'll never get complete coverage, but it will at least be repeatable.
EDIT: I'm sure there are places where random tests are useful, although probably not for unit tests. However, in this case I'd like to suggest something: use one RNG to create a random but known seed, and then seed a new RNG with that value - and log it. That way if something interesting happens you will be able to reproduce it by starting an RNG with the logged seed.
With respect to the 3rd question, in my opinion random tests are not well suited for unit testing. If applied to the same piece of code, a unit test should succeed always, or fail always (i.e., wrong behavior due to bugs should be reproducible). You could however use random techniques to generate a large data set, then use that data set within your unit tests; there's nothing wrong with it.
Wow, great question! Some thoughts:
Random testing is always a good confidence building activity, though as you mentioned, it's best suited to certain types of code.
It's an excellent way to stress-test any code whose performance may be related to the number of times it's been executed, or to the sequence of inputs.
For fairly simple code, or code that expects a limited type of input, I'd prefer systematic test that explicitly cover all of the likely cases, samples of each unlikely or pathological case, and all the boundary conditions.
Q1) I found that distributed systems with lots of concurrency are good candidates for randomized testing. It is hard to create all possible scenarios for such applications, but random testing can expose problems that you never thought about.
Q2) I guess you could try to use statistics to build an confidence interval around having discovered all "bugs". But the practical answer is: run your randomized tests as many times as you can afford.
Q3) I have found that randomized testing is useful but after you have written the normal battery of unit, integration and regression tests. You should integrate your randomized tests as part of the normal test suite, though probably a small run. If nothing else, you avoid bit rot in the tests themselves, and get some modicum coverage as the team runs the tests with different random inputs.
Q4) When writing randomized tests, make sure you save the random seed with the results of the tests. There is nothing more frustrating than finding that your random tests caught a bug, and not being able to run the test again with the same input. Make sure your test can either be executed with the saved seed too.
A few things:
With random testing, you can't really tell how good a piece of code is, but you can tell how bad it is.
Random testing is better suited for things that have random inputs -- a prime example is anything that's exposed to users. So, for example, something that randomly clicks & types all over your app (or OS) is a good test of general robustness.
Similarly, developers count as users. So something that randomly assembles a GUI from your framework is another good candidate.
Again, you're not going to find all the bugs this way -- what you're looking for is "if I do a million whacky things, do ANY of them result in system corruption?" If not, you can feel some level of confidence that your app/OS/SDK/whatever might hold up to a few days' exposure to users.
...But, more importantly, if your random-beater-upper test app can crash your app/OS/SDK in about 5 minutes, that's about how long you'll have until the first fire-drill if you try to ship that sucker.
Also note: REPRODUCIBILITY IS IMPORTANT IN TESTING! Hence, have your test-tool log the random-seed that it used, and have a parameter to start with the same seed. In addition, have it either start from a known "base state" (i.e., reinstall everything from an image on a server & start there) or some recreatable base-state (i.e., reinstall from that image, then alter it according to some random-seed that the test tool takes as a parameter.)
Of course, the developers will appreciate if the tool has nice things like "save state every 20,000 events" and "stop right before event #" and "step forward 1/10/100 events." This will greatly aid them in reproducing the problem, finding and fixing it.
As someone else pointed out, servers are another thing exposed to users. Get yourself a list of 1,000,000 URLs (grep from server logs), then feed them to your random number generator.
And remember: "system went 24 hours of random pounding without errors" does not mean it's ready to ship, it just means it's stable enough to start some serious testing. Before it can do that, QA should feel free to say "look, your POS can't even last 24 hours under life-like random user simulation -- you fix that, I'm going to spend some time writing better tools."
Oh yeah, one last thing: in addition to the "pound it as fast & hard as you can" tests, have the ability to do "exactly what a real user [who was perhaps deranged, or a baby bounding the keyboard/mouse] would do." That is, if you're doing random user-events; do them at the speed that a very-fast typist or very-fast mouse-user could do (with occasional delays, to simulate a SLOW person), in addition to "as fast as my program can spit-out events." These are two **very different* types of tests, and will get very different reactions when bugs are found.
To make tests reproducible, simply use a fixed seed start value. That ensures the same data is used whenever the test runs. Tests will reliably pass or fail.
Good / bad candidates? Randomized tests are good at finding edge cases (exceptions). A problem is to define the correct result of a randomized input.
Determining the number of times to run the code: Simply try it out, if it takes too long reduce the iteration count. You may want to use a code coverage tool to find out what part of your application is actually tested.
Are these kinds of tests well suited for unit tests? Yes.
This might be slightly off-topic, but if you're using .net, there is Pex, which does something similar to randomized testing, but with more intuition by attempting to generate a "random" test case that exercises all of the paths through your code.
Here is my answer to a similar question: Is it a bad practice to randomly-generate test data?. Other answers may be useful as well.
Random testing is a bad practice a
long as you don't have a solution for
the oracle problem, i.e.,
determining which is the expected
outcome of your software given its
input.
If you solved the oracle problem, you
can get one step further than simple
random input generation. You can
choose input distributions such that
specific parts of your software get
exercised more than with simple
random.
You then switch from random testing to
statistical testing.
if (a > 0)
// Do Foo
else (if b < 0)
// Do Bar
else
// Do Foobar
If you select a and b randomly in
int range, you exercise Foo 50% of
the time, Bar 25% of the time and
Foobar 25% of the time. It is likely
that you will find more bugs in Foo
than in Bar or Foobar.
If you select a such that it is
negative 66.66% of the time, Bar and
Foobar get exercised more than with
your first distribution. Indeed the
three branches get exercised each
33.33% of the time.
Of course, if your observed outcome is
different than your expected outcome,
you have to log everything that can be
useful to reproduce the bug.
Random testing has the huge advantage that individual tests can be generated for extremely low cost. This is true even if you only have a partial oracle (for example, does the software crash?)
In a complex system, random testing will find bugs that are difficult to find by any other means. Think about what this means for security testing: even if you don't do random testing, the black hats will, and they will find bugs you missed.
A fascinating subfield of random testing is randomized differential testing, where two or more systems that are supposed to show the same behavior are stimulated with a common input. If their behavior differs, a bug (in one or both) has been found. This has been applied with great effect to testing of compilers, and invariably finds bugs in any compiler that has not been previously confronted with the technique. Even if you have only one compiler you can try it on different optimization settings to look for varying results, and of course crashes always mean bugs.