I would like to try out test-driven development, but the project I am working on involves a lot of randomness and I am very unsure about how I can test it. Here is a toy example of the kind of algorithm I may want to write:
Write a function taking no argument and returning a list of random integers satisfying the following properties
Each integer is between 0 and 10
The same number doesn’t appear twice
The list is of length 3 90% of the time, and of length 4 10% of the time
There is a 50% chance for the number 3 to appear
I do not need to test exact statistical distribution, but obviously I would like tests that will fail if someone completely removes the corresponding code.
I am using an external RNG that you can assume is correct, and I am pretty free in how to structure the code, so I can use dependency injection to have tests use a fake RNG instead, but I still don’t really see how that would help. For instance even if I always use the same seed for the tests, as soon as I refactor the algorithm to pick random numbers in a different order, all the tests become meaningless.
I guess that the first two points could be tested by generating many cases and checking that the constraints are satisfied, but that doesn’t really feel like TDD.
For the last two points, I’m thinking of having tests with different configurations, where for instance the 90% is either 100% or 0%, and then I can test if the length of the list is indeed 3 or 4. I guess it would work, but it seems maybe a bit weak.
Is there any guidelines or other techniques to use when using TDD to test algorithms involving randomness?
There are several ways you can go about a problem like this, and I may add another answer in the future, but the approach that I immediately found most compelling would be to combine test-driven development (TDD) with property-based testing.
You can do this in many languages, with various frameworks. Here, I'm going to use the original property-based testing library, QuickCheck.
The first two requirements translate directly to predicates that QuickCheck can exercise. The latter two translates into distribution tests - a more advanced feature of QuickCheck that John Hughes explains in this presentation.
Each one in turn.
Preliminaries
Before writing the first test, you're going to set up tests and import the appropriate libraries:
module RintsProperties where
import Test.Framework (Test)
import Test.Framework.Providers.QuickCheck2
import Test.QuickCheck
import Q72168364
where the System Under Test (SUT) is defined in the Q72168364 library. The SUT itself is an action called rints (for Random INTS):
rints :: IO [Int]
Since it's going to generate random numbers, it'll have to run in IO.
Image
The first requirement says something about the image of the SUT. This is easily expressed as a property:
testProperty "Each integer is between 0 and 10" $ \() -> ioProperty $ do
actual <- rints
return $
counterexample ("actual: " ++ show actual) $
all (\i -> 0 <= i && i <= 10) actual
If you ignore some of the ceremony involved with producing a useful assertion message and such, the central assertion is this:
all (\i -> 0 <= i && i <= 10) actual
which verifies that all integers i in actual are between 0 and 10.
In true TDD fashion, the simplest implementation that passes the test is this:
rints :: IO [Int]
rints = return []
Always return an empty list. While degenerate, it fulfils the requirement.
No duplicates
The next requirement also translates easily to a predicate:
testProperty "The same number does not appear twice" $ \() -> ioProperty $ do
actual <- rints
return $ nub actual === actual
nub removes duplicates, so this assertion states that nub actual (actual where duplicates are removed) should be equal to actual. This is only going to be the case if there are no duplicates in actual.
In TDD fashion, the implementation unfortunately doesn't change:
rints :: IO [Int]
rints = return []
In fact, when I wrote this property, it passed right away. If you follow the red-green-refactor checklist, this isn't allowed. You should start each cycle by writing a red test, but this one was immediately green.
The proper reaction should be to discard (or stash) that test and instead write another one - perhaps taking a cue from the Transformation Priority Premise to pick a good next test.
For instructional reasons, however, I will stick with the order of requirements as they are stated in the OP. Instead of following the red-green-refactor checklist, I modified rints in various ways to assure myself that the assertion works as intended.
Length distribution
The next requirement isn't a simple predicate, but rather a statement about the distribution of outcomes. QuickCheck's cover function enables that - a feature that I haven't seen in other property-based testing libraries:
testProperty "Length is and distribution is correct" $ \() -> ioProperty $ do
actual <- rints
let l = length actual
return $
checkCoverage $
cover 90 (l == 3) "Length 3" $
cover 10 (l == 4) "Length 4"
True -- Base property, but really, the distribution is the test
The way that cover works, it needs to have a 'base property', but here I simply return True - the base property always passes, meaning that the distribution is the actual test.
The two instances of cover state the percentage with which each predicate (l == 3 and l == 4) should appear.
Running the tests with the degenerate implementation produces this test failure:
Length is and distribution is correct: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
Only 0% Length 3, but expected 90%
As the message states, it expected 90% of Length 3 cases, but got 0%.
Again, following TDD, one can attempt to address the immediate error:
rints :: IO [Int]
rints = return [1,2,3]
This, however, now produces this test failure:
Length is and distribution is correct: [Failed]
*** Failed! Insufficient coverage (after 400 tests):
100.0% Length 3
Only 0.0% Length 4, but expected 10.0%
The property expects 10% Length 4 cases, but got 0%.
Perhaps the following is the simplest thing that could possibly work?
import System.Random.Stateful
rints :: IO [Int]
rints = do
p <- uniformRM (1 :: Int, 100) globalStdGen
if 10 < p then return [1,2,3] else return [1,2,3,4]
Perhaps not quite as random as you'd expect, but it passes all tests.
More threes
The final (explicit) requirement is that 3 should appear 50% of the times. That's another distribution property:
testProperty "3 appears 50% of the times" $ \() -> ioProperty $ do
actual <- rints
return $
checkCoverage $
cover 50 (3 `elem` actual) "3 present" $
cover 50 (3 `notElem` actual) "3 absent"
True -- Base property, but really, the distribution is the test
Running all tests produces this test failure:
3 appears 50% of the times: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
100% 3 present
Only 0% 3 absent, but expected 50%
Not surprisingly, it says that the 3 present case happens 100% of the time.
In the spirit of TDD (perhaps a little undisciplined, but it illustrates what's going on), you may attempt to modify rints like this:
rints :: IO [Int]
rints = do
p <- uniformRM (1 :: Int, 100) globalStdGen
if 10 < p then return [1,2,3] else return [1,2,4,5]
This, however, doesn't work because the distribution is still wrong:
3 appears 50% of the times: [Failed]
*** Failed! Insufficient coverage (after 100 tests):
89% 3 present
11% 3 absent
Only 11% 3 absent, but expected 50%
Perhaps the following is the simplest thing that works. It's what I went with, at least:
rints :: IO [Int]
rints = do
p <- uniformRM (1 :: Int, 100) globalStdGen
includeThree <- uniformM globalStdGen
if 10 < p
then if includeThree then return [1,2,3] else return [1,2,4]
else if includeThree then return [1,2,3,4] else return [1,2,4,5]
Not elegant, and it still doesn't produce random numbers, but it passes all tests.
Random numbers
While the above covers all explicitly stated requirements, it's clearly unsatisfactory, as it doesn't really produce random numbers between 1 and 10.
This is typical of the TDD process. As you write tests and SUT and let the two interact, you discover that more tests are required than you originally thought.
To be honest, I wasn't sure what the best approach would be to 'force' generation of all numbers between 0 and 10. Now that I had the hammer of distribution tests, I wrote the following:
testProperty "All numbers are represented" $ \() -> ioProperty $ do
actual <- rints
return $
checkCoverage $
cover 5 ( 0 `elem` actual) " 0 present" $
cover 5 ( 1 `elem` actual) " 1 present" $
cover 5 ( 2 `elem` actual) " 2 present" $
cover 5 ( 3 `elem` actual) " 3 present" $
cover 5 ( 4 `elem` actual) " 4 present" $
cover 5 ( 5 `elem` actual) " 5 present" $
cover 5 ( 6 `elem` actual) " 6 present" $
cover 5 ( 7 `elem` actual) " 7 present" $
cover 5 ( 8 `elem` actual) " 8 present" $
cover 5 ( 9 `elem` actual) " 9 present" $
cover 5 (10 `elem` actual) "10 present"
True -- Base property, but really, the distribution is the test
I admit that I'm not entirely happy with this, as it doesn't seem to 'scale' to problems where the function image is much larger. I'm open to better alternatives.
I also didn't want to be too specific about the exact distribution of each number. After all, 3 is going to appear more frequently than the others. For that reason, I just picked a small percentage (5%) to indicate that each number should appear not too rarely.
The implementation of rints so far failed this new test in the same manner as the other distribution tests.
Crudely, I changed the implementation to this:
rints :: IO [Int]
rints = do
p <- uniformRM (1 :: Int, 100) globalStdGen
let l = if 10 < p then 3 else 4
ns <- shuffle $ [0..2] ++ [4..10]
includeThree <- uniformM globalStdGen
if includeThree
then do
let ns' = take (l - 1) ns
shuffle $ 3 : ns'
else
return $ take l ns
While I feel that there's room for improvement, it passes all tests and actually produces random numbers:
ghci> rints
[5,2,1]
ghci> rints
[9,2,10]
ghci> rints
[8,1,3]
ghci> rints
[0,9,8]
ghci> rints
[0,10,3,6]
This example used QuickCheck with Haskell, but most of the ideas translate to other languages. QuickCheck's cover function may be an exception to that rule, since I'm not aware that it's been ported to common language implementations, but perhaps I'm just behind the curve.
In situation where something like cover isn't available, you'd have to write a test that loops through enough randomly generated test cases to verify that the distribution is as required. A little more work, but not impossible.
Since Nikos Baxevanis asked, here's the shuffle implementation:
shuffle :: [a] -> IO [a]
shuffle xs = do
ar <- newArray l xs
forM [1..l] $ \i -> do
j <- uniformRM (i, l) globalStdGen
vi <- readArray ar i
vj <- readArray ar j
writeArray ar j vi
return vj
where
l = length xs
newArray :: Int -> [a] -> IO (IOArray Int a)
newArray n = newListArray (1, n)
I lifted it from https://wiki.haskell.org/Random_shuffle and perhaps edited a bit.
I would like to try out test-driven development, but the project I am working on involves a lot of randomness
You should be aware that "randomness" hits TDD in a rather awkward spot, so it isn't the most straight forward "try-it-out" project.
There are two concerns - one that "randomness" is a very expensive assertion to make:
How would you reliably distinguish between this implementation and a "real" random number generator that just happens to be emitting a finite sequence of 4s before changing to some other number?
So we get to choose between stable tests that don't actually express all of our constraints or more precise tests that occasionally report incorrect results.
One design approach here it to lean into "testability" - behind the facade of our interface will be an implementation that combines a general purpose source of random bits with a deterministic function that maps a bit sequence to some result.
def randomListOfIntegers():
seed_bits = generalPurpose.random_bits()
return determisticFunction(seed_bits)
def deterministicFunction(seed_bits):
???
The claim being that randomListOfIntegers is "so simple that there are obviously no deficiencies", so we can establish its correctness by inspection, and concentrate our effort on the design of deterministicFunction.
Now, we run into a second problem: the mapping of seed_bits to some observable result is arbitrary. Most business domain problems (ex: a payroll system) have a single expected output for any given input, but in random systems you still have some extra degrees of freedom. If you write a function that produces an acceptable answer given any sequence of bits, then my function, which reverses the bits then calls your function, will also produce acceptable answers -- even though my answers and your answers are different.
In effect, if we want a suite of tests that alert when a code change causes a variation in behavior, then we have to invent the specification of the behavior that we want to lock in.
And unless we have a good guess as to which arbitrary behaviors will support a clean implementation, that can be pretty painful.
(Alternatively, we just lean on our pool of "acceptance" tests, which ignore code changes that switch to a different arbitrary behavior -- it's trade offs all the way down).
One of the simpler implementations we might consider is to treat the seed_bits as an index into a sequence of candidate responses
def deterministicFunction(seed_bits):
choices = ???
n = computeIndex(seed_bits, len(choices))
return choices[n]
This exposes yet another concern: k seed_bits means 2^k degrees of freedom; unless len(choices) happens to be a power of 2 and not bigger than 2^k, there's going to be some bias in choosing. You can make the bias error arbitrarily small by choosing a large enough value for k, but you can't eliminate it with a finite number of bits.
Breaking down the problem further, we can split the work into two elements, one responsible for producing the pool of candidates, another for actually choosing one of them.
def deterministicFunction(seed_bits):
return choose(seed_bits, weighted_candidates())
def choose(seed_bits, weighted_candidates):
choices = []
# note: the order of elements in this enumeration
# is still arbitrary
for candidate, weight in weighted_candidates:
for _ in range(weight):
choices.add(candidate)
# technically, this is also still arbirary
n = computeIndex(seed_bits, len(choices))
return choices[n]
At this point, we can decide to use "simplest thing that could possibly work" to implement computeIndex (test first, if you like), and this new weighted_candidates() function is also easy to test, since each test of it is just "count the candidates and make sure that the problem constraints are satisfied by the population as a whole". choose can be tested using much simpler populations as inputs.
This kind of an implementation could be unsatisfactory - after all, we're building this data structure of candidates, and then another of choices, only to pick a single one. That may be the best we can do. Often, however, different implementation is possible.
The problem specification, in effect, defines for us the size of the (weighted) population of responses. In other words, len(choices) is really some constant L.
choices = [ generate(n) for n in range(L)]
n = computeIndex(seed_bits, L)
return choices[n]
which in turn can be simplified to
n = computeIndex(seed_bits, L)
return generate(n)
Which is to say, we don't need to pass around a bunch of data structures if we can calculate which response is in the nth place.
Notice that while generate(n) still has arbitrary behavior, there are definitive assertions we can make about the data structure [generate(n) for n in range(L)].
Refactoring a bit to clean things up, we might have
def randomListOfIntegers():
seed_bits = generalPurpose.random_bits()
n = computeIndex(seed_bits, L)
return generateListOfIntegers(n)
Note that this skeleton hasn't "emerged" from a writing out a bunch of tests and refactoring, but instead from thinking about the problem and the choices that we need to consider in order to "control the gap between decision and feedback".
It's probably fair to call this a "spike" - a sandbox exercise that we use to better understand the problem we are trying to solve.
The same number doesn’t appear twice
An awareness of combinatorics is going to help here.
Basic idea: we can compute the set of all possible arrangements of 4 unique elements of the set [0,1,2,3,4,5,6,7,8,9,10], and we can use a technique called squashed ordering to produce a specific subset of them.
Here, we'd probably want to handle the special case of 3 a bit more carefully. The rough skeleton is going to look something like
def generateListOfIntegers(n):
other_numbers = [0,1,2,4,5,6,7,8,9,10]
has3, hasLength3, k = decode(n)
if has3:
if hasLength3:
# 45 distinct candidates
assert 0 <= k < 45
return [3] ++ choose(other_numbers, 2, k)
else:
# 120 distinct candidates
assert 0 <= k < 120
return [3] ++ choose(other_numbers, 3, k)
else:
if hasLength3:
# 120 distinct candidates
assert 0 <= k < 120
return choose(other_numbers, 3, k)
else:
# 210 distinct candidates
assert 0<= k < 210
return choose(other_numbers, 4, k)
Where choose(other_numbers, j, k) returns the kth subset of other_numbers with j total elements, and decode(n) has the logic necessary to ensure that the population weights come out right.
The behavior of choose is arbitrary, but there is a "natural" order to the progression of subsets (ie, you can "sort" them), so it's reasonable to arbitrarily use the sorted order.
It's probably also worth noticing that choose is very general purpose - the list we pass in could be just about anything, and it really doesn't care what you do with the answer. Compare that with decode, where our definition of the "right" behavior is tightly coupled to its consumption by generateListOfNumbers.
You may want to review Peter Seiber's Fischer Chess Exercise, to see the different approaches people were taking when TDD was new. Warning, the threading is horribly broken now, so you may need to sift through multiple threads to find all the good bits.
First of all, there are more than one approach to TDD, so there's no single right answer. But here's my take on this:
You mentioned that you don't need to test exact statistical distribution, but I think that you must. Otherwise, writing the simplest code that satisfies your tests will result in a completely deterministic, non-random solution. (If you'd look at your requirements without thinking about randomness, you'll find out that you can satisfy them using a simple loop and few if statements). But apparently, that's not what you really want. Therefore in your case, you must write a test that checks the statistical distribution of your algorithm.
Such tests needs to gather many results of your function under tests and therefore may be slow, so some people will consider it a bad practice, but IMHO this is your only way to actually test what you really care about.
Assuming that this is not only a theoretical exercise, you may also want to save the results to a file that you can later examine manually (e.g. using Excel), check additional statistical properties of the results, and potentially add or adjust your tests accordingly.
I'm trying to replace some old unit tests with property based testing (PBT), concreteley with scala and scalatest - scalacheck but I think the problem is more general. The simplified situation is , if I have a method I want to test:
def upcaseReverse(s:String) = s.toUpperCase.reverse
Normally, I would have written unit tests like:
assertEquals("GNIRTS", upcaseReverse("string"))
assertEquals("", upcaseReverse(""))
// ... corner cases I could think of
So, for each test, I write the output I expect, no problem. Now, with PBT, it'd be like :
property("strings are reversed and upper-cased") {
forAll { (s: String) =>
assert ( upcaseReverse(s) == ???) //this is the problem right here!
}
}
As I try to write a test that will be true for all String inputs, I find my self having to write the logic of the method again in the tests. In this case the test would look like :
assert ( upcaseReverse(s) == s.toUpperCase.reverse)
That is, I had to write the implementation in the test to make sure the output is correct.
Is there a way out of this? Am I misunderstanding PBT, and should I be testing other properties instead, like :
"strings should have the same length as the original"
"strings should contain all the characters of the original"
"strings should not contain lower case characters"
...
That is also plausible but sounds like much contrived and less clear. Can anybody with more experience in PBT shed some light here?
EDIT : following #Eric's sources I got to this post, and there's exactly an example of what I mean (at Applying the categories one more time): to test the method times in (F#):
type Dollar(amount:int) =
member val Amount = amount
member this.Add add =
Dollar (amount + add)
member this.Times multiplier =
Dollar (amount * multiplier)
static member Create amount =
Dollar amount
the author ends up writing a test that goes like:
let ``create then times should be same as times then create`` start multiplier =
let d0 = Dollar.Create start
let d1 = d0.Times(multiplier)
let d2 = Dollar.Create (start * multiplier) // This ones duplicates the code of Times!
d1 = d2
So, in order to test that a method, the code of the method is duplicated in the test. In this case something as trivial as multiplying, but I think it extrapolates to more complex cases.
This presentation gives some clues about the kind of properties you can write for your code without duplicating it.
In general it is useful to think about what happens when you compose the method you want to test with other methods on that class:
size
++
reverse
toUpperCase
contains
For example:
upcaseReverse(y) ++ upcaseReverse(x) == upcaseReverse(x ++ y)
Then think about what would break if the implementation was broken. Would the property fail if:
size was not preserved?
not all characters were uppercased?
the string was not properly reversed?
1. is actually implied by 3. and I think that the property above would break for 3. However it would not break for 2 (if there was no uppercasing at all for example). Can we enhance it? What about:
upcaseReverse(y) ++ x.reverse.toUpper == upcaseReverse(x ++ y)
I think this one is ok but don't believe me and run the tests!
Anyway I hope you get the idea:
compose with other methods
see if there are equalities which seem to hold (things like "round-tripping" or "idempotency" or "model-checking" in the presentation)
check if your property will break when the code is wrong
Note that 1. and 2. are implemented by a library named QuickSpec and 3. is "mutation testing".
Addendum
About your Edit: the Times operation is just a wrapper around * so there's not much to test. However in a more complex case you might want to check that the operation:
has a unit element
is associative
is commutative
is distributive with the addition
If any of these properties fails, this would be a big surprise. If you encode those properties as generic properties for any binary relation T x T -> T you should be able to reuse them very easily in all sorts of contexts (see the Scalaz Monoid "laws").
Coming back to your upperCaseReverse example I would actually write 2 separate properties:
"upperCaseReverse must uppercase the string" >> forAll { s: String =>
upperCaseReverse(s).forall(_.isUpper)
}
"upperCaseReverse reverses the string regardless of case" >> forAll { s: String =>
upperCaseReverse(s).toLowerCase === s.reverse.toLowerCase
}
This doesn't duplicate the code and states 2 different things which can break if your code is wrong.
In conclusion, I had the same question as you before and felt pretty frustrated about it but after a while I found more and more cases where I was not duplicating my code in properties, especially when I starting thinking about
combining the tested function with other functions (.isUpper in the first property)
comparing the tested function with a simpler "model" of computation ("reverse regardless of case" in the second property)
I have called this problem "convergent testing" but I can't figure out why or where there term comes from so take it with a grain of salt.
For any test you run the risk of the complexity of the test code approaching the complexity of the code under test.
In your case, the the code winds up being basically the same which is just writing the same code twice. Sometimes there is value in that. For example, if you are writing code to keep someone in intensive care alive, you could write it twice to be safe. I wouldn't fault you for the abundance of caution.
For other cases there comes a point where the likelihood of the test breaking invalidates the benefit of the test catching real issues. For that reason, even if it is against best practice in other ways (enumerating things that should be calculated, not writing DRY code) I try to write test code that is in some way simpler than the production code, so it is less likely to fail.
If I cannot find a way to write code simpler than the test code, that is also maintainable(read: "that I also like"), I move that test to a "higher" level(for example unit test -> functional test)
I just started playing with property based testing but from what I can tell it is hard to make it work with many unit tests. For complex units, it can work, but I find it more helpful at functional testing so far.
For functional testing you can often write the rule a function has to satisfy much more simply than you can write a function that satisfies the rule. This feels to me a lot like the P vs NP problem. Where you can write a program to VALIDATE a solution in linear time, but all known programs to FIND a solution take much longer. That seems like a wonderful case for property testing.
Suppose I have the following function (pseudocode):
bool checkObjects(a, b)
{
if ((a.isValid() && (a.hasValue()) ||
(b.isValid() && (b.hasValue()))
{
return true;
}
return false;
}
Which tests should I write to be able to claim that it's 100% covered?
There are total 16 possible input combinations. Should I write 16 test cases, or should I try to act smart and omit some test cases?
For example, should I write test for
[a valid and has value, b valid and has value]
if I tested that it returns what expected for
[a valid and has value, b invalid and has value]
and
[a invalid and has value, b valid and has value]
?
Thanks!
P.S.: Maybe someone can suggest good reading on unit testing approaches?
Test Driven Development by Kent Beck is well-done and is becoming a classic (http://www.amazon.com/Test-Driven-Development-Kent-Beck/dp/0321146530)
If you wanted to be thorough to the max then yes 16 checks would be worthwhile.
It depends. Speaking personally, I'd be satisfied to test all boundary conditions. So both cases where it is true, but making one item false would make the overall result false, and all 4 false cases where making one item true would make the overall result true. But that is a judgment call, and I wouldn't fault someone who did all 16 cases.
Incidentally if you unit tested one true case and one false one, code coverage tools would say that you have 100% coverage.
If you are woried about writing 16 test cases, you can try some features like NUnit TestCase or MbUnit RowTest. Other languages/frameworks should have similar features.
This would allow you to test all 16 conditions with a single (and small) test case).
If testing seems hard, think about refactoring. I can see several approaches here. First merge isValid() and hasValue() into one method and test it separately. And why have checkObjects(a, b) testing two unrelated objects? Why can't you have checkObject(a) and checkObject(b), decreasing the exponential growth of possibilities further? Just a hint.
If you really want to test all 16 possibilities, consider some more table-ish tools, like Fitnesse (see http://fitnesse.org/FitNesse.UserGuide.FitTableStyles). Also check Parameterized JUnit runner and TestNG.
Consider the following code (from a requirement that says that 3 is special for some reason):
bool IsSpecial(int value)
if (value == 3)
return true
else
return false
I would unit test this with a couple of functions - one called TEST(3IsSpecial) that asserts that when passed 3 the function returns true and another that passes some random value other than 3 and asserts that the function returns false.
When the requirement changes and say it now becomes 3 and 20 are special, I would write another test that verifies that when called with 20 this function returns true as well. That test would fail and I would then go and update the if condition in the function.
Now, what if there are people on my team who do not believe in unit testing and they make this change. They will directly go and change the code and since my second unit test might not test for 20 (it could be randomly picking an int or have some other int hardcoded). Now my tests aren't in sync with the code. How do I ensure that when they change the code some unit test or the other fails?
I could be doing something grossly wrong here so any other techniques to get around this are also welcome.
That's a good question. As you note a Not3IsNotSpecial test picking a random non-3 value would be the traditional approach. This wouldn't catch a change in the definition of "special".
In a .NET environment you can use the new code contracts capability to write the test predicate (the postcondition) directly in the method. The static analyzer would catch the defect you proposed. For example:
Contract.Ensures(value != 3 && Contract.Result<Boolean>() == false);
I think anybody that's a TDD fan is experimenting with contracts now to see use patterns. The idea that you have tools to prove correctness is very powerful. You can even specify these predicates for an interface.
The only testing approach I've seen that would address this is Model Based Testing. The idea is similar to the contracts approach. You set up the Not3IsNotSpecial condition abstractly (e.g., IsSpecial(x => x != 3) == false)) and let a model execution environment generate concrete tests. I'm not sure but I think these environments do static analysis as well. Anyway, you let the model execution environment run continuously against your SUT. I've never used such an environment, but the concept is interesting.
Unfortunately, that specific scenario is something that is difficult to guard against. With a function like IsSpecial, it's unrealistic to test all four billion negative test cases, so, no, you're not doing something grossly wrong.
Here's what comes to me off the top of my head. Many repositories have hooks that allow you to run some process on each check-in, such as running the unit tests. It's possible to set a criterion that newly checked in code must reach some threshold of code coverage under unit tests. If the commit does not meet certain metrics, it is rejected.
I've never had to set one of these systems up, so I don't know what is involved, but I do know it's possible.
And believe me, I feel your pain. I work with people who are similarly resistant to unit testing.
One thing you need to think about is why 3 is a special character and others are not. If it is defining some aspect of your application, you can take that aspect out and make an enum out of it.
Now you can check here that this test should fail if value doesn't exist in enum. And for enum class write a test to check for possible values. If there is new possible value being added your test should fail.
So your method will become:
bool IsSpecial(int value)
if (SpecialValues.has(value))
return true
else
return false
and your SpecialValues will be an enum like:
enum SpecialValues {
Three(3), Twenty(20)
public int value;
}
and now you should write to test possible values for enum. A simple test can be to check total number of possible values and another test can be to check the possible values itself
The other point to make is that in a less contrived example:
20 might have been some valid condition to test for based on knowledge of the business domain. Writing tests in a BDD style based on knowledge of the business problem might have helped you explicitly catch it.
4 might have been a good value to test for due to its status as a boundary condition. This may have been more likely to change in the real world so would more likely show up in a full test case.
I have installed on my computer C++Test only with UnitTest license (only Unit Test license) as a Visual Studio 2005 plugin ( cpptest_7.2.11.35_win32_vs2005_plugin.exe ).
I have a sample similar to the following:
bool MyFunction(... parameters... )
{
bool bRet = true;
// do something
if( some_condition )
{
// do something
bRet = CallToAFunctionThatCanReturnBothTrueAndFalse....
}
else
{
bRet = false;
// do something
}
if(bRet == false)
{
// do something
}
return bRet;
}
In my case after running the coverage tool I have the following results (for a function similar to the previously mentioned):
[LC=100 BC=100 PC=75 DC=100 SCC=100 MCDC=50 (%)]
I really don't understand why I don't have 100% coverage when it comes to PathCoverage (PC).
Also if someone who has experience with C++Test Parasoft could explain the low MCDC coverage for me that would be great.
What should I do to increase coverage? as I'm out of ideas in this case.
Directions to (some parts of) the documentation are welcome.
Thank you,
Iulian
I can't help with the specific tool you're using, but the general idea with path coverage is that each possible path through the code should be executed.
If you draw a flowchart through the program, branching at each if/break/continue, etc. you should see which paths your tests are taking through the program. To get 100% (which isn't totally necessary, nor does it assure a perfect test) your test will have to go down every branch of the code, executing every line.
Hope that helps.
This is a good reference on the various types of code coverage: http://www.bullseye.com/coverage.html.
MCDC: To improve MCDC coverage you'll need to look at some_condition. Assuming it's a complex boolean expression, you'll need to look at whether you're exercising the necessary combinations of values. Specifically, each boolean sub-expression needs to be exercised true and false.
Path: One of the things mentioned in the link above as being a disadvantage of path coverage is that many paths are impossible to exercise. That may be the case with your example.
You need at least two testcases to get 100% coverage. One where some_condition is true and one where it is not. If you have that you should get 100% coverage.
Though you should see 100% coverage as perfect. You would need 3 tests for that in this case so all combinations can be tested. Look up cyclomatic complexity to learn more about that.
There are four hypothetical paths through that function. Each if-clause doubles the number of paths. Each if-statement is a branch where you can go two different ways. So whenever your tool encounters an "if", it assumes the code can either take the "true" branch or the "false" branch. However, this is not always possible. Consider:
bool x = true;
if (x) {
do_something();
}
The "false" branch of the if-statement is unreachable. This is an obvious example, but when you factor in several if-statements it becomes increasingly difficult to see whether a path is possible or not.
There are only three possible paths in your code. The path that takes the "false" branch in the first if statement and the "true" branch in the second is unreachable.
Your tool is not smart enough to realize that.
That being said, even if the tool is perfect, obtaining 100 % path coverage is probably unlikely in a real application. However, very low path coverage is a sure sign that your method has too high cyclomatic complexity.
Personally, I think it's bad form to start ANY function with
bool retCode = true;
You're making an explicit assumption that it'll succeed by default, and then fail under certain conditions.
Programmers coming after you will not make this same assumption.
Fail fast, fail early.
And as others have said, if you want to test failure cases, you have to code tests that fail.