Testing with random inputs best practices

Testing with random inputs best practices - unit-testing

NOTE: I mention the next couple of paragraphs as background. If you just want a TL;DR, feel free to skip down to the numbered questions as they are only indirectly related to this info.
I'm currently writing a python script that does some stuff with POSIX dates (among other things). Unit testing these seems a little bit difficult though, since there's such a wide range of dates and times that can be encountered.
Of course, it's impractical for me to try to test every single date/time combination possible, so I think I'm going to try a unit test that randomizes the inputs and then reports what the inputs were if the test failed. Statisically speaking, I figure that I can achieve a bit more completeness of testing than I could if I tried to think of all potential problem areas (due to missing things) or testing all cases (due to sheer infeasability), assuming that I run it enough times.
So here are a few questions (mainly indirectly related to the above ):
What types of code are good candidates for randomized testing? What types of code aren't?
How do I go about determining the number of times to run the code with randomized inputs? I ask this because I want to have a large enough sample to determine any bugs, but don't want to wait a week to get my results.
Are these kinds of tests well suited for unit tests, or is there another kind of test that it works well with?
Are there any other best practices for doing this kind of thing?
Related topics:
Random data in unit tests?

I agree with Federico - randomised testing is counterproductive. If a test won't reliably pass or fail, it's very hard to fix it and know it's fixed. (This is also a problem when you introduce an unreliable dependency, of course.)
Instead, however, you might like to make sure you've got good data coverage in other ways. For instance:
Make sure you have tests for the start, middle and end of every month of every year between 1900 and 2100 (if those are suitable for your code, of course).
Use a variety of cultures, or "all of them" if that's known.
Try "day 0" and "one day after the end of each month" etc.
In short, still try a lot of values, but do so programmatically and repeatably. You don't need every value you try to be a literal in a test - it's fine to loop round all known values for one axis of your testing, etc.
You'll never get complete coverage, but it will at least be repeatable.
EDIT: I'm sure there are places where random tests are useful, although probably not for unit tests. However, in this case I'd like to suggest something: use one RNG to create a random but known seed, and then seed a new RNG with that value - and log it. That way if something interesting happens you will be able to reproduce it by starting an RNG with the logged seed.

With respect to the 3rd question, in my opinion random tests are not well suited for unit testing. If applied to the same piece of code, a unit test should succeed always, or fail always (i.e., wrong behavior due to bugs should be reproducible). You could however use random techniques to generate a large data set, then use that data set within your unit tests; there's nothing wrong with it.

Wow, great question! Some thoughts:
Random testing is always a good confidence building activity, though as you mentioned, it's best suited to certain types of code.
It's an excellent way to stress-test any code whose performance may be related to the number of times it's been executed, or to the sequence of inputs.
For fairly simple code, or code that expects a limited type of input, I'd prefer systematic test that explicitly cover all of the likely cases, samples of each unlikely or pathological case, and all the boundary conditions.

Q1) I found that distributed systems with lots of concurrency are good candidates for randomized testing. It is hard to create all possible scenarios for such applications, but random testing can expose problems that you never thought about.
Q2) I guess you could try to use statistics to build an confidence interval around having discovered all "bugs". But the practical answer is: run your randomized tests as many times as you can afford.
Q3) I have found that randomized testing is useful but after you have written the normal battery of unit, integration and regression tests. You should integrate your randomized tests as part of the normal test suite, though probably a small run. If nothing else, you avoid bit rot in the tests themselves, and get some modicum coverage as the team runs the tests with different random inputs.
Q4) When writing randomized tests, make sure you save the random seed with the results of the tests. There is nothing more frustrating than finding that your random tests caught a bug, and not being able to run the test again with the same input. Make sure your test can either be executed with the saved seed too.

A few things:
With random testing, you can't really tell how good a piece of code is, but you can tell how bad it is.
Random testing is better suited for things that have random inputs -- a prime example is anything that's exposed to users. So, for example, something that randomly clicks & types all over your app (or OS) is a good test of general robustness.
Similarly, developers count as users. So something that randomly assembles a GUI from your framework is another good candidate.
Again, you're not going to find all the bugs this way -- what you're looking for is "if I do a million whacky things, do ANY of them result in system corruption?" If not, you can feel some level of confidence that your app/OS/SDK/whatever might hold up to a few days' exposure to users.
...But, more importantly, if your random-beater-upper test app can crash your app/OS/SDK in about 5 minutes, that's about how long you'll have until the first fire-drill if you try to ship that sucker.
Also note: REPRODUCIBILITY IS IMPORTANT IN TESTING! Hence, have your test-tool log the random-seed that it used, and have a parameter to start with the same seed. In addition, have it either start from a known "base state" (i.e., reinstall everything from an image on a server & start there) or some recreatable base-state (i.e., reinstall from that image, then alter it according to some random-seed that the test tool takes as a parameter.)
Of course, the developers will appreciate if the tool has nice things like "save state every 20,000 events" and "stop right before event #" and "step forward 1/10/100 events." This will greatly aid them in reproducing the problem, finding and fixing it.
As someone else pointed out, servers are another thing exposed to users. Get yourself a list of 1,000,000 URLs (grep from server logs), then feed them to your random number generator.
And remember: "system went 24 hours of random pounding without errors" does not mean it's ready to ship, it just means it's stable enough to start some serious testing. Before it can do that, QA should feel free to say "look, your POS can't even last 24 hours under life-like random user simulation -- you fix that, I'm going to spend some time writing better tools."
Oh yeah, one last thing: in addition to the "pound it as fast & hard as you can" tests, have the ability to do "exactly what a real user [who was perhaps deranged, or a baby bounding the keyboard/mouse] would do." That is, if you're doing random user-events; do them at the speed that a very-fast typist or very-fast mouse-user could do (with occasional delays, to simulate a SLOW person), in addition to "as fast as my program can spit-out events." These are two **very different* types of tests, and will get very different reactions when bugs are found.

To make tests reproducible, simply use a fixed seed start value. That ensures the same data is used whenever the test runs. Tests will reliably pass or fail.
Good / bad candidates? Randomized tests are good at finding edge cases (exceptions). A problem is to define the correct result of a randomized input.
Determining the number of times to run the code: Simply try it out, if it takes too long reduce the iteration count. You may want to use a code coverage tool to find out what part of your application is actually tested.
Are these kinds of tests well suited for unit tests? Yes.

This might be slightly off-topic, but if you're using .net, there is Pex, which does something similar to randomized testing, but with more intuition by attempting to generate a "random" test case that exercises all of the paths through your code.

Here is my answer to a similar question: Is it a bad practice to randomly-generate test data?. Other answers may be useful as well.
Random testing is a bad practice a
long as you don't have a solution for
the oracle problem, i.e.,
determining which is the expected
outcome of your software given its
input.
If you solved the oracle problem, you
can get one step further than simple
random input generation. You can
choose input distributions such that
specific parts of your software get
exercised more than with simple
random.
You then switch from random testing to
statistical testing.
if (a > 0)
// Do Foo
else (if b < 0)
// Do Bar
else
// Do Foobar
If you select a and b randomly in
int range, you exercise Foo 50% of
the time, Bar 25% of the time and
Foobar 25% of the time. It is likely
that you will find more bugs in Foo
than in Bar or Foobar.
If you select a such that it is
negative 66.66% of the time, Bar and
Foobar get exercised more than with
your first distribution. Indeed the
three branches get exercised each
33.33% of the time.
Of course, if your observed outcome is
different than your expected outcome,
you have to log everything that can be
useful to reproduce the bug.

Random testing has the huge advantage that individual tests can be generated for extremely low cost. This is true even if you only have a partial oracle (for example, does the software crash?)
In a complex system, random testing will find bugs that are difficult to find by any other means. Think about what this means for security testing: even if you don't do random testing, the black hats will, and they will find bugs you missed.
A fascinating subfield of random testing is randomized differential testing, where two or more systems that are supposed to show the same behavior are stimulated with a common input. If their behavior differs, a bug (in one or both) has been found. This has been applied with great effect to testing of compilers, and invariably finds bugs in any compiler that has not been previously confronted with the technique. Even if you have only one compiler you can try it on different optimization settings to look for varying results, and of course crashes always mean bugs.

Related

Unit Testing Machine Learning Code

I am writing a fairly complicated machine learning program for my thesis in computer vision. It's working fairly well, but I need to keep trying out new things out and adding new functionality. This is problematic because I sometimes introduce bugs when I am extending the code or trying to simplify an algorithm.
Clearly the correct thing to do is to add unit tests, but it is not clear how to do this. Many components of my program produce a somewhat subjective answer, and I cannot automate sanity checks.
For example, I had some code that approximated a curve with a lower-resolution curve, so that I could do computationally intensive work on the lower-resolution curve. I accidentally introduced a bug into this code, and only found it through a painstaking search when my the results of my entire program got slightly worse.
But, when I tried to write a unit-test for it, it was unclear what I should do. If I make a simple curve that has a clearly correct lower-resolution version, then I'm not really testing out everything that could go wrong. If I make a simple curve and then perturb the points slightly, my code starts producing different answers, even though this particular piece of code really seems to work fine now.

You may not appreciate the irony, but basically what you have there is legacy code: a chunk of software without any unit tests. Naturally you don't know where to begin. So you may find it helpful to read up on handling legacy code.
The definitive thought on this is Michael Feather's book, Working Effectively with Legacy Code. There used to be a helpful summary of that on the ObjectMentor site, but alas the website has gone the way of the company. However WELC has left a legacy in reviews and other articles. Check them out (or just buy the book), although the key lessons are the ones which S.Lott and tvanfosson cover in their replies.
2019 update: I have fixed the link to the WELC summary with a version from the Wayback Machine web archive (thanks #milia).
Also - and despite knowing that answers which comprise mainly links to other sites are low quality answers :) - here is a link to a new (2019 new) Google tutorial on Testing and Debugging ML code. I hope this will be of illumination to future Seekers who stumble across this answer.

"then I'm not really testing out everything that could go wrong."
Correct.
The job of unit tests is not to test everything that could go wrong.
The job of unit tests is to test that what you have does the right thing, given specific inputs and specific expected results. The important part here is the specific visible, external requirements are satisfied by specific test cases. Not that every possible thing that could go wrong is somehow prevented.
Nothing can test everything that could go wrong. You can write a proof, but you'll be hard-pressed to write tests for everything.
Choose your test cases wisely.
Further, the job of unit tests is to test that each small part of the overall application does the right thing -- in isolation.
Your "code that approximated a curve with a lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation. The integrated whole could also be tested to be sure that it works.
Your "computationally intensive work on the lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation.
That point of unit testing is to create small, correct units that are later assembled.

Without seeing your code, it's hard to tell, but I suspect that you are attempting to write tests at too high a level. You might want to think about breaking your methods down into smaller components that are deterministic and testing these. Then test the methods that use these methods by providing mock implementations that return predictable values from the underlying methods (which are probably located on a different object). Then you can write tests that cover the domain of the various methods, ensuring that you have coverage of the full range of possible outcomes. For the small methods you do so by providing values that represent the domain of inputs. For the methods that depend on these, by providing mock implementations that return the range of outcomes from the dependencies.

Your unit tests need to employ some kind of fuzz factor, either by accepting approximations, or using some kind of probabilistic checks.
For example, if you have some function that returns a floating point result, it is almost impossible to write a test that works correctly across all platforms. Your checks would need to perform the approximation.
TEST_ALMOST_EQ(result, 4.0);
Above TEST_ALMOST_EQ might verify that result is between 3.9 and 4.1 (for example).
Alternatively, if your machine learning algorithms are probabilistic, your tests will need to accommodate for it by taking the average of multiple runs and expecting it to be within some range.
x = 0;
for (100 times) {
x += result_probabilistic_test();
}
avg = x/100;
TEST_RANGE(avg, 10.0, 15.0);
Ofcourse, the tests are non-deterministic, so you will need to tune them such that you can get non-flaky tests with a high probability. (E.g., increase the number of trials, or increase the range of error).
You can also use mocks for this (e.g, a mock random number generator for your probabilistic algorithms), and they usually help for deterministically testing specific code paths, but they are a lot of effort to maintain. Ideally, you would use a combination of fuzzy testing and mocks.
HTH.

Generally, for statistical measures you would build in an epsilon for your answer. I.E. the mean square difference of your points would be < 0.01 or some such. Another option is to run several times and if it fails "too often" then you have an issue.

Get an appropriate test dataset (maybe a subset of what your using usually)
Calculate some metric on this dataset (e.g. the accuracy)
Note down the value obtained (cross-validated)
This should give an indication of what to set the threshold for
Of course if can be that when making changes to your code the performance on the dataset will increase a little, but if it ever decreases by large this would be an indication something is going wrong.

Using randomness and/or iterations in unit tests?

In unit tests, I have become used to test methods applying some regular values, some values offending the method contract, and all border-cases I can come up with.
But is it very bad-practice to
test on random values, this is a value within a range you think should never give any trouble, so that each time the test runs, another value is passed in? As a kind of extensive testing of regular values?
test on whole ranges, using iteration ?
I have a feeling both of this approaches aren't any good. With range-testing I can imagine that it's just not practical to do that, since the time it is taking, but with randomness?
UPDATE :
I'm not using this technique myself, was just wondering about it. Randomness can be a good tool, I know now, if you can make it reproduceable when you need to.
The most interesting reply was the 'fuzzing' tip from Lieven :
http://en.wikipedia.org/wiki/Fuzz_testing
tx

Unit tests need to be fast. if they aren't people won't run them regularly. At times I did code for checking the whole range but #Ignore'd commented it out in the end because it made the tests too slow. If I were to use random values, I would go for a PRNG with fixed seeds so that every run actually checks the same numbers.

Random Input - The tests would not be repeatable (produce consistent results every time they are run and hence are not considered good unit tests. Tests should not change their mind.
Range tests / RowTests - are good as long as they dont slow down the test suite run.. each test should run as fast as possible. (A done-in-30sec test suite gets run more often than a 10 min one) - preferably 100ms or less. That said Each input (test data) should be 'representative' input. If all input values are the same, testing each one isn't adding any value and is just routine number crunching. You just need one representative from that set of values. You also need representatives for boundary conditions and 'special' values.
For more on guidelines or thumbrules - see 'What makes a Good Unit Test?'
That said... the techniques you mentioned could be great to find representative inputs.. So use them to find scenarioX where code fails or succeeds incorrectly - then write up a repeatable,quick,tests-one-thing-only unit test for that scenarioX and add it to your test suite. If you find that these tools continue to help you find more good test-cases.. persist with them.
Response to OP's clarification:
If you use the same seed value (test input) for your random no generator on each test run, your test is not random - values can be predetermined. However a unit test ideally shouldn't need any input/output - that is why xUnit test cases have the void TC() signature.
If you use different seed values on each run, now your tests are random and not repeatable. Of course you can hunt down the special seed value in your log files to know what failed (and reproduce the error) but I like my tests to instantly let me know what failed - e.g. a Red TestConversionForEnums() lets me know that the Enum Conversion code is broken without any inspection.
Repeatable - implies that each time the test is run on the SUT, it produces the same result (pass/fail).. not 'Can I reproduce test failure again?' (Repeatable != Reproducible). To reiterate.. this kind of exploratory testing may be good to identify more test cases but I wouldn't add this to my test suite that I run each time I make a code change during the day. I'd recommend doing exploratory testing manually, find some good (some may use sadistic) Testers that'll go hammer and tongs at your code.. will find you more test cases than a random input generator.

I have been using randomness in my testcases. It found me some errors in the SUT and it gave me some errors in my testcase.
Note that the testcase get more complex by using randomnes.
You'll need a method to run your testcase with the random value(s) it failed on
You'll need to log the random values used for every test.
...
All in all, I'm throthling back on using randomness but not dismissing it enterly. As with every technique, it has its values.
For a better explanation of what you are after, look up the term fuzzing

What you describe is usually called specification-based testing and has been implemented by frameworks such as QuickCheck (Haskell), scalacheck (Scala) and Quviq QuickCheck (Erlang).
Data-based testing tools (such as DataProvider in TestNG) can achieve similar results.
The underlying principle is to generate input data for the subject under test based upon some sort of specification and is far from "bad practice".

What are you testing? The random number generator? Or your code?
If your code, what if there is a bug in the code that produces random numbers?
What if you need to reproduce a problem, do you keep restarting the test hoping that it will eventually use the same sequence as you had when you discovered the problem?
If you decide to use a random number generator to produce data, at least seed it with a known constant value, so it's easy to reproduce.
In other words, your "random numbers" are just a "sequence of numbers I really don't care all that much about".

So long as it will tell you in some way what random value it failed on I don't suppose it's that bad. However, you're almost relying on luck to find a problem in your application then.
Testing the whole range will ensure you have every avenue covered but it seems like overkill when you have the edges covered and, I assume, a few middle-ground accepted values.

The goal of unit-testing is to get confidence in your code. Therefore, if you feel that using random values could help you find some more bugs, you obviously need more tests to increase your confidence level.
In that situation, you could rely on iteration-based testing to identify those problems.
I'd recommend creating new specific tests for the cases discovered with the loop testing, and removing the iteration-based tests then; so that they don't slow down your tests.

I have used randomness for debugging a field problem with a state machine leaking a resource. We code inspected, ran the unit-tests and couldn't reproduce the leak.
We fed random events from the entire possible event space into the state machine unit test environment. We looked at the invariants after each event and stopped when they were violated.
The random events eventually exposed a sequence of events that produced a leak.
The state machine leaked a resource when a 2nd error occurred while recovering from a first error.
We were then able to reproduce the leak in the field.
So randomness found a problem that was difficult to find otherwise. A little brute force but the computer didn't mind working the weekend.

I wouldn't advocate completely random values as it will give you a false sense of security. If you can't go through the entire range (which is often the case) it's far more efficient to select a subset by hand. This way you will also have to think of possible "odd" values, values that causes the code to run differently (and are not near edges).
You could use a random generator to generate the test values, check that they represent a good sample and then use them. This is a good idea especially if choosing by hand would be too time-consuming.
I did use random test values when I wrote a semaphore driver to use for a hw block from two different chips. In this case I couldn't figure out how to choose meaningful values for the timings so I randomized how often the chips would (independently) try to access the block. In retrospect it would still have been better to choose them by hand, because getting the test environment to work in such a way that the two chips didn't align themselves was not as simple as I thought. This was actually a very good example of when random values do not create a random sample.
The problem was caused by the fact that whenever the other chip had reserved the block the other waited and true to a semaphore got access right after the other released it. When I plotted how long the chips had to wait for access the values were in fact far from random. Worst was when I had the same value range for both random values, it got slightly better after I changed them to have different ranges, but it still wasn't very random. I started getting something of a random test only after I randomized both the waiting times between accesses and how long the block was reserved and chose the four sets carefully.
In the end I probably ended up using more time writing the code to use "random" values than I would have used to pick meaningful values by hand in the first place.

See David Saff's work on Theory-Based Testing.
Generally I'd avoid randomness in unit tests, but the theory stuff is intriguing.

The 'key' point here is unit test. A slew of random values in the expected ranged as well as edges for the good case and ouside range/boundary for bad case is valuable in a regression test, provided the seed is constant.
a unit test may use random values in the expected range, if it is possible to always save the inputs/outputs (if any) before and after.

Testing When Correctness is Poorly Defined?

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions.
However, a lot of the code I write is data mining code that basically looks for significant patterns in large datasets. Correct behavior in this case is often not well defined and depends on a lot of different inputs in ways that are not easy for a human to predict (i.e. the math can't reasonably be done by hand, which is why I'm using a computer to solve the problem in the first place). These inputs can be very complex, to the point where coming up with a reasonable test case is near impossible. Identifying the edge cases that are worth testing is extremely difficult. Sometimes the algorithm isn't even deterministic.
Usually, I do the best I can by using asserts for sanity checks and creating a small toy test case with a known pattern and informally seeing if the answer at least "looks reasonable", without it necessarily being objectively correct. Is there any better way to test these kinds of cases?

I think you just need to write unit tests based on small sets of data that will make sure that your code is doing exactly what you want it to do. If this gives you a reasonable data-mining algorithm is a separate issue, and I don't think it is possible to solve it by unit tests. There are two "levels" of correctness of your code:
Your code is correctly implementing the given data mining algorithm (this thing you should unit-test)
The data mining algorithm you implement is "correct" - solves the business problem. This is a quite open question, it probably depends both on some parameters of your algorithm as well as on the actual data (different algorithms work for different types of data).

When facing cases like this I tend to build one or more stub data sets that reflect the proper underlying complexities of the real-life data. I often do this together with the customer, to make sure I capture the essence of the complexities.
Then I can just codify these into one or more datasets that can be used as basis for making very specific unit tests (sometimes they're more like integration tests with stub data, but I don't think that's an important distinction). So while your algorithm may have "fuzzy" results for a "generic" dataset, these algorithms almost always have a single correct answer for a specific dataset.

Well, there are a few answers.
First of all, as you mentioned, take a small case study, and do the math by hand. Since you wrote the algorithm, you know what it's supposed to do, so you can do it in a limited case.
The other one is to break down every component of your program into testable parts.
If A calls B calls C calls D, and you know that A,B,C,D, all give the right answer, then you test A->B, B->C, and C->D, then you can be reasonably sure that A->D is giving the correct response.
Also, if there are other programs out there that do what you are looking to do, try and aquire their datasets. Or an opensource project that you could use test data against, and see if your application is giving similar results.
Another way to test datamining code is by taking a test set, and then introducing a pattern of the type you're looking for, and then test again, to see if it will separate out the new pattern from the old ones.
And, the tried and true, walk through your own code by hand and see if the code is doing what you meant it to do.

Really, the challenge here is this: because your application is meant to do a fuzzy, non-deterministic kind of task in a smart way, the very goal you hope to achieve is that the application becomes better than human beings at finding these patterns. That's great, powerful, and cool ... but if you pull it off, then it becomes very hard for any human beings to say, "In this case, the answer should be X."
In fact, ideally the computer would say, "Not really. I see why you think that, but consider these 4.2 terabytes of information over here. Have you read them yet? Based on those, I would argue that the answer should be Z."
And if you really succeeded in your original goal, the end user might sometimes say, "Zowie, you're right. That is a better answer. You found a pattern that is going to make us money! (or save us money, or whatever)."
If such a thing could never happen, then why are you asking the computer to detect these kinds of patterns in the first place?
So, the best thing I can think of is to let real life help you build up a list of test scenarios. If there ever was a pattern discovered in the past that did turn out to be valuable, then make a "unit test" that sees if your system discovers it when given similar data. I say "unit test" in quotes because it may be more like an integration test, but you may still choose to use NUnit or VS.Net or RSpec or whatever unit test tools you're using.
For some of these tests, you might somehow try to "mock" the 4.2 terabytes of data (you won't really mock the data, but at some higher level you'd mock some of the conclusions reached from that data). For others, maybe you have a "test database" with some data in it, from which you expect a set of patterns to be detected.
Also, if you can do it, it would be great if the system could "describe its reasoning" behind the patterns it detects. This would let the business user deliberate over the question of whether the application was right or not.

This is tricky. This sounds similar to writing tests around our text search engine. If you keep struggling, you'll figure something out:
Start with a small, simplified but reasonably representative data sample, and test basic behavior doing this
Rather than asserting that the output is exactly some answer, sometimes it's better to figure out what is important about it. For example, for our search engine, I didn't care so much about the exact order the documents were listed, as long as the three key ones were on the first page of results.
As you make a small, incremental change, figure out what the essence of it is and write a test for that. Even though the overall calculations take many inputs, individual changes to the codebase should be isolatable. For example, we found certain documents weren't being surfaced because of the presence of hyphens in some of the key words. We created tests that testing that this was behaving how we expected.
Look at tools like Fitness, which allow you to throw a large number of datasets at a piece of code and assert things about the results. This may be easier to understand than more traditional unit tests.
I've gone back to the product owner, saying "I can't understand how this will work. How will we know if it's right?" Maybe s/he can articulate the essence of the vaguely defined problem. This has worked really well for me many times, and I've talked people out of features because they couldn't be explained.
Be creative!

Ultimately, you have to decide what your program should be doing, and then test for that.

How to decide test cases for unit tests?

I'm just getting into unit testing, and have written some short tests to check if function called isPrime() works correctly.
I've got a test that checks that the function works, and have some test data in the form of some numbers and the expected return value.
How many should I test? How do I decide on which to test? What's the best-practices here?
One approach would be to generate 1000 primes, then loop through them all, another would be to just select 4 or 5 and test them. What's the correct thing to do?

I've also been informed that every time a bug is found, you should write a test to verify that it is fixed. It seems reasonable to me, anyway.

You'd want to check edge cases. How big a prime number is your method supposed to be able to handle? This will depend on what representation (type) you used. If you're only interested in small (really relative term when used in number theory) primes, you're probably using int or long. Test a handful of the biggest primes you can in the representation you've chosen. Make sure you check some non-prime numbers too. (These are much easier to verify independently.)
Naturally, you'll also want to test a few small numbers (primes and non-primes) and a few in the middle of the range. A handful of each should be plenty. Also make sure you throw an exception (or return an error code, whichever is your preference) for numbers that are out of range of your valid inputs.

in general, test as many cases as you need to feel comfortable/confident
also in general, test the base/zero case, the maximum case, and at least one median/middle case
also test expected-exception cases, if applicable
if you're unsure of your prime algorithm, then by all means test it with the first 1000 primes or so, to gain confidence

"Beware of bugs. I have proven the above algorithm correct, but have not tested it yet."
Some people don't understand the above quote (paraphrase?), but it makes perfect sense when you think about it. Tests will never prove an algorithm correct, they only help to indicate whether you've coded it right. Write tests for mistakes you expect might appear and for boundary conditions to achieve good code coverage. Don't just be picking values out of the blue to see if they work, because that might lead to lots of tests which all test exactly the same thing.
For your example, just hand-select a few primes and non-primes to test specific conditions in the implementation.

Ask yourself: what EXACTLY do I want to test, and test the most important things. Test to make sure it basically does what you are expecting it to do in the expected cases.
Testing all those nulls and edge-cases - I - don't think is real, too time consuming and someone needs to maintain that later!
And...your test code should be simple enough so that you do not need to test your test code!
If you want to check that your function correctly applied the algorithm and works in general - probably will be enough some primes.
If you want prove that the method for finding primes is CORRECT - 100000 primes will not be enough. But you don't want to test the latter (probably).
Only you know what you want to test!
PS I think using loops in unit tests is not always wrong but I would think twice before doing that. Test code should be VERY simple. What if something goes wrong and there is a bug in your test? However, you should try to avoid test code duplication as regular code duplication. Someone has to maintain test code.

A few questions, the answers may inform your decision:
How important is the correct functioning of this code?
Is the implementation of this code likely to be changed in the future? (if so, test more to support the future change)
Is the public contract of this code likely to change in the future? (if so, test less - to reduce the amount of throw-away test code)
How is the code coverage, are all branches visited?
Even if the code doesn't branch, are boundary considerations tested?
Do the tests run quickly?
Edit: Hmm, so to advise in your specific scenario. Since you started writing unit tests yesterday, you might not have the experience to decide among all these factors. Let me help you:
This code is probably not too important (no one dies, no one goes to war, no one is sued), so a smattering of tests will be fine.
The implementation probably won't change (prime number techniques are well known), so we don't need tests to support this. If the implementation does change, it's probably due to an observed failing value. That can be added as a new test at the time of change.
The public contract of this won't change.
Get 100% code coverage on this. There's no reason to write code that a test doesn't visit in this circumstance. You should be able to do this with a small number of tests.
If you care what the code does when zero is called, test that.
The small number of tests should run quickly. This will allow them to be run frequently (both by developers and by automation).
I would test 1, 2, 3, 21, 23, a "large" prime (5 digits), a "large" non-prime and 0 if you care what this does with 0.

To be really sure, you're going to have to test them all. :-)
Seriously though, for this kind of function you're probably using an established and proven algorithm. The main thing you need to do is verify that your code correctly implements the algorithm.
The other thing is to make sure you understand the limits of your number representation, whatever that is. At the very least, this will put an upper limit on the size the number you can test. E.g., if you use a 32-bit unsigned int, you're never going to be able to test values larger than 4G. Possibly your limit will be lower than that, depending on the details of your implementation.
Just as an example of something that could go wrong with an implementation:
A simple algorithm for testing primes is to try dividing the candidate by all known primes up to the square root of the candidate. The square root function will not necessarily give an exact result, so to be safe you should go a bit past that. How far past would depend on specifically how the square root function is implemented and how much it could be off.
Another note on testing: In addition to testing known primes to see if your function correctly identifies them as prime, also test known composite numbers to make sure you're not getting "false positives." To make sure you get that square root function thing right, pick some composite numbers that have a prime factor as close as possible to their square root.
Also, consider how you're going to "generate" your list of primes for testing. Can you trust that list to be correct? How were those numbers tested, and by whom?
You might consider coding two functions and testing them against each other. One could be a simple but slow algorithm that you can be more sure of having coded correctly, and the other a faster but more complex one that you really want to use in your app, but is more likely to have a coding mistake.

"If it's worth building, it's worth testing"
"If it's not worth testing, why are you wasting your time working on it?"
I'm guessing you didn't subscribe to test first, where you write the test before you write the code?
Not to worry, I'm sure you are not alone.
As has been said above, test the edges, great place to start. Also must test bad cases, if you only test what you know works, you can be sure that a bad case will happen in production at the worst time.
Oh and "I've just found the last bug" - HA HA.

Random data in Unit Tests?

I have a coworker who writes unit tests for objects which fill their fields with random data. His reason is that it gives a wider range of testing, since it will test a lot of different values, whereas a normal test only uses a single static value.
I've given him a number of different reasons against this, the main ones being:
random values means the test isn't truly repeatable (which also means that if the test can randomly fail, it can do so on the build server and break the build)
if it's a random value and the test fails, we need to a) fix the object and b) force ourselves to test for that value every time, so we know it works, but since it's random we don't know what the value was
Another coworker added:
If I am testing an exception, random values will not ensure that the test ends up in the expected state
random data is used for flushing out a system and load testing, not for unit tests
Can anyone else add additional reasons I can give him to get him to stop doing this?
(Or alternately, is this an acceptable method of writing unit tests, and I and my other coworker are wrong?)

There's a compromise. Your coworker is actually onto something, but I think he's doing it wrong. I'm not sure that totally random testing is very useful, but it's certainly not invalid.
A program (or unit) specification is a hypothesis that there exists some program that meets it. The program itself is then evidence of that hypothesis. What unit testing ought to be is an attempt to provide counter-evidence to refute that the program works according to the spec.
Now, you can write the unit tests by hand, but it really is a mechanical task. It can be automated. All you have to do is write the spec, and a machine can generate lots and lots of unit tests that try to break your code.
I don't know what language you're using, but see here:
Java
http://functionaljava.org/
Scala (or Java)
http://github.com/rickynils/scalacheck
Haskell
http://www.cs.chalmers.se/~rjmh/QuickCheck/
.NET:
http://blogs.msdn.com/dsyme/archive/2008/08/09/fscheck-0-2.aspx
These tools will take your well-formed spec as input and automatically generate as many unit tests as you want, with automatically generated data. They use "shrinking" strategies (which you can tweak) to find the simplest possible test case to break your code and to make sure it covers the edge cases well.
Happy testing!

This kind of testing is called a Monkey test. When done right, it can smoke out bugs from the really dark corners.
To address your concerns about reproducibility: the right way to approach this, is to record the failed test entries, generate a unit test, which probes for the entire family of the specific bug; and include in the unit test the one specific input (from the random data) which caused the initial failure.

There is a half-way house here which has some use, which is to seed your PRNG with a constant. That allows you to generate 'random' data which is repeatable.
Personally I do think there are places where (constant) random data is useful in testing - after you think you've done all your carefully-thought-out corners, using stimuli from a PRNG can sometimes find other things.

I am in favor of random tests, and I write them. However, whether they are appropriate in a particular build environment and which test suites they should be included in is a more nuanced question.
Run locally (e.g., overnight on your dev box) randomized tests have found bugs both obvious and obscure. The obscure ones are arcane enough that I think random testing was really the only realistic one to flush them out. As a test, I took one tough-to-find bug discovered via randomized testing and had a half dozen crack developers review the function (about a dozen lines of code) where it occurred. None were able to detect it.
Many of your arguments against randomized data are flavors of "the test isn't reproducible". However, a well written randomized test will capture the seed used to start the randomized seed and output it on failure. In addition to allowing you to repeat the test by hand, this allows you to trivially create new test which test the specific issue by hardcoding the seed for that test. Of course, it's probably nicer to hand-code an explicit test covering that case, but laziness has its virtues, and this even allows you to essentially auto-generate new test cases from a failing seed.
The one point you make that I can't debate, however, is that it breaks the build systems. Most build and continuous integration tests expect the tests to do the same thing, every time. So a test that randomly fails will create chaos, randomly failing and pointing the fingers at changes that were harmless.
A solution then, is to still run your randomized tests as part of the build and CI tests, but run it with a fixed seed, for a fixed number of iterations. Hence the test always does the same thing, but still explores a bunch of the input space (if you run it for multiple iterations).
Locally, e.g., when changing the concerned class, you are free to run it for more iterations or with other seeds. If randomized testing ever becomes more popular, you could even imagine a specific suite of tests which are known to be random, which could be run with different seeds (hence with increasing coverage over time), and where failures wouldn't mean the same thing as deterministic CI systems (i.e., runs aren't associated 1:1 with code changes and so you don't point a finger at a particular change when things fail).
There is a lot to be said for randomized tests, especially well written ones, so don't be too quick to dismiss them!

If you are doing TDD then I would argue that random data is an excellent approach. If your test is written with constants, then you can only guarantee your code works for the specific value. If your test is randomly failing the build server there is likely a problem with how the test was written.
Random data will help ensure any future refactoring will not rely on a magic constant. After all, if your tests are your documentation, then doesn't the presence of constants imply it only needs to work for those constants?
I am exaggerating however I prefer to inject random data into my test as a sign that "the value of this variable should not affect the outcome of this test".
I will say though that if you use a random variable then fork your test based on that variable, then that is a smell.

In the book Beautiful Code, there is a chapter called "Beautiful Tests", where he goes through a testing strategy for the Binary Search algorithm. One paragraph is called "Random Acts of Testing", in which he creates random arrays to thoroughly test the algorithm. You can read some of this online at Google Books, page 95, but it's a great book worth having.
So basically this just shows that generating random data for testing is a viable option.

Your co-worker is doing fuzz-testing, although he doesn't know about it. They are especially valuable in server systems.

One advantage for someone looking at the tests is that arbitrary data is clearly not important. I've seen too many tests that involved dozens of pieces of data and it can be difficult to tell what needs to be that way and what just happens to be that way. E.g. If an address validation method is tested with a specific zip code and all other data is random then you can be pretty sure the zip code is the only important part.

if it's a random value and the test fails, we need to a) fix the object and b) force ourselves to test for that value every time, so we know it works, but since it's random we don't know what the value was
If your test case does not accurately record what it is testing, perhaps you need to recode the test case. I always want to have logs that I can refer back to for test cases so that I know exactly what caused it to fail whether using static or random data.

You should ask yourselves what is the goal of your test.
Unit tests are about verifying logic, code flow and object interactions. Using random values tries to achieve a different goal, thus reduces test focus and simplicity. It is acceptable for readability reasons (generating UUID, ids, keys,etc.).
Specifically for unit tests, I cannot recall even once this method was successful finding problems. But I have seen many determinism problems (in the tests) trying to be clever with random values and mainly with random dates.
Fuzz testing is a valid approach for integration tests and end-to-end tests.

Can you generate some random data once (I mean exactly once, not once per test run), then use it in all tests thereafter?
I can definitely see the value in creating random data to test those cases that you haven't thought of, but you're right, having unit tests that can randomly pass or fail is a bad thing.

If you're using random input for your tests you need to log the inputs so you can see what the values are. This way if there is some edge case you come across, you can write the test to reproduce it. I've heard the same reasons from people for not using random input, but once you have insight into the actual values used for a particular test run then it isn't as much of an issue.
The notion of "arbitrary" data is also very useful as a way of signifying something that is not important. We have some acceptance tests that come to mind where there is a lot of noise data that is no relevance to the test at hand.

I think the problem here is that the purpose of unit tests is not catching bugs. The purpose is being able to change the code without breaking it, so how are you going to know that you break
your code when your random unit tests are green in your pipeline, just because they doesn't touch the right path?
Doing this is insane for me. A different situation could be running them as integration tests or e2e not as a part of the build, and just for some specific things because in some situations you will need a mirror of your code in your asserts to test that way.
And having a test suite as complex as your real code is like not having tests at all because who
is going to test your suite then? :p

A unit test is there to ensure the correct behaviour in response to particular inputs, in particular all code paths/logic should be covered. There is no need to use random data to achieve this. If you don't have 100% code coverage with your unit tests, then fuzz testing by the back door is not going to achieve this, and it may even mean you occasionally don't achieve your desired code coverage. It may (pardon the pun) give you a 'fuzzy' feeling that you're getting to more code paths, but there may not be much science behind this. People often check code coverage when they run their unit tests for the first time and then forget about it (unless enforced by CI), so do you really want to be checking coverage against every run as a result of using random input data? It's just another thing to potentially neglect.
Also, programmers tend to take the easy path, and they make mistakes. They make just as many mistakes in unit tests as they do in the code under test. It's way too easy for someone to introduce random data, and then tailor the asserts to the output order in a single run. Admit it, we've all done this. When the data changes the order can change and the asserts fail, so a portion of the executions fail. This portion needn't be 1/2 I've seen exactly this result in failures 10% of the time. It takes a long time to track down problems like this, and if your CI doesn't record enough data about enough of the runs, then it can be even worse.
Whilst there's an argument for saying "just don't make these mistakes", in a typical commercial programming setup there'll be a mix of abilities, sometimes relatively junior people reviewing code for other junior people. You can write literally dozens more tests in the time it takes to debug one non-deterministic test and fix it, so make sure you don't have any. Don't use random data.

In my experience unit tests and randomized tests should be separated. Unit tests serve to give a certainty of the correctness of some cases, not only to catch obscure bugs.
All that said, randomized testing is useful and should be done, separately from unit tests, but it should test a series of randomized values.
I can't help to think that testing 1 random value with every run is just not enough, neither to be a sufficient randomized test, neither to be a truly useful unit test.
Another aspect is validating the test results. If you have random inputs, you have to calculate the expected output for it inside the test. This will at some level duplicate the tested logic, making the test only a mirror of the tested code itself. This will not sufficiently test the code, since the test might contain the same errors the original code does.

This is an old question, but I wanted to mention a library I created that generates objects filled with random data. It supports reproducing the same data if a test fails by supplying a seed. It also supports JUnit 5 via an extension.
Example usage:
Person person = Instancio.create(Person.class);
Or a builder API for customising generation parameters:
Person person = Instancio.of(Person.class)
.generate(field("age"), gen -> gen.ints.min(18).max(65))
.create();
Github link has more examples: https://github.com/instancio/instancio
You can find the library on maven central:
<dependency>
<groupId>org.instancio</groupId>
<artifactId>instancio-junit</artifactId>
<version>LATEST</version>
</dependency>

Depending on your object/app, random data would have a place in load testing. I think more important would be to use data that explicitly tests the boundary conditions of the data.

We just ran into this today. I wanted pseudo-random (so it would look like compressed audio data in terms of size). I TODO'd that I also wanted deterministic. rand() was different on OSX than on Linux. And unless I re-seeded, it could change at any time. So we changed it to be deterministic but still psuedo-random: the test is repeatable, as much as using canned data (but more conveniently written).
This was NOT testing by some random brute force through code paths. That's the difference: still deterministic, still repeatable, still using data that looks like real input to run a set of interesting checks on edge cases in complex logic. Still unit tests.
Does that still qualify is random? Let's talk over beer. :-)

I can envisage three solutions to the test data problem:
Test with fixed data
Test with random data
Generate random data once, then use it as your fixed data
I would recommend doing all of the above. That is, write repeatable unit tests with both some edge cases worked out using your brain, and some randomised data which you generate only once. Then write a set of randomised tests that you run as well.
The randomised tests should never be expected to catch something your repeatable tests miss. You should aim to cover everything with repeatable tests, and consider the randomised tests a bonus. If they find something, it should be something that you couldn't have reasonably predicted; a real oddball.

How can your guy run the test again when it has failed to see if he has fixed it? I.e. he loses repeatability of tests.
While I think there is probably some value in flinging a load of random data at tests, as mentioned in other replies it falls more under the heading of load testing than anything else. It is pretty much a "testing-by-hope" practice. I think that, in reality, your guy is simply not thinkng about what he is trying to test, and making up for that lack of thought by hoping randomness will eventually trap some mysterious error.
So the argument I would use with him is that he is being lazy. Or, to put it another way, if he doesn't take the time to understand what he is trying to test it probably shows he doesn't really understand the code he is writing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js