How to guarantee FsCheck reproducibility - unit-testing

We want to use FsCheck as part of our unit testing in continuous integration. As such deterministic and reproducible behaviour is very important for us.
FsCheck, being a random testing framework, can generate test cases that potentially sometimes break. The key is, we do not only use properties that would have to hold for necessarily every input, like say List.rev >> List.rev === id. But rather, we do some numerics and some test cases can cause the test to break because of being badly conditioned.
The question is: how can we guarantee, that once the test succeeds it will always succeed?
So far I see the following options:
hard code the seed, e.g. 0. This would be the easiest solution.
make very specific custom generators which avoid bad examples. Certainly possible, but could turn out pretty hard, especially if there are many objects to generate.
live with it, that in some cases the build might be red due to pathological cases and simply re-run.
What is the idiomatic way of using FsCheck in such a setting?

some test cases can cause the test to break because of being badly conditioned.
That sounds like you need a Conditional Property:
let isOk x =
match x with
| 42 -> false
| _ -> true
let MyProperty (x:int) = isOk x ==> // check x here...
(assuming that you don't like the number 42.)

(I started writing a comment but it got so long I guess it deserved its own answer).
It's very common to test properties with FsCheck that don't hold for every input. For example, FsCheck will trivially refute your List.rev example if you run it for list<float>.
Numerical stability is a tricky problem in itself - there isn't any non-determinism in FsCheck to blame here(FsCheck is totally deterministic, it's just an input generator...). The "non-determinism" you're referring to may be things like bugs in floating point operations in certain processors and so on. But even in that case, wouldn't you like to know about them? And if you algorithm is numerically unstable for a class of inputs, wouldn't you like to know about it? If you don't it seems to me like you're setting yourself up for some real non-determinism...in production.
The idiomatic way to write properties that don't hold for all inputs of a given type in FsCheck is to write a generator & shrinker. You can use ==> as a step up to that, but it doesn't scale up well to complex preconditions. You say this could turn out pretty hard - that's true in the sense that I guarantee you'll learn something about your code. A good thing!
Fixing the seed is a bad idea, except for reproducing a previously discovered bug. I mean, in practice what would you do: keep re-running the test until it passes, then fix the seed and declare "job done"?

Related

Should function composition and piping be tested?

In F# (and most of the functional languages) some codes are extremely short as follows:
let f = getNames
>> Observable.flatmap ObservableJson.jsonArrayToObservableObjects<string>
or :
let jsonArrayToObservableObjects<'t> =
JsonConvert.DeserializeObject<'t[]>
>> Observable.ToObservable
And the simplest property-based test I ended up for the latter function is :
testList "ObservableJson" [
testProperty "Should convert an Observable of `json` array to Observable of single F# objects" <| fun _ ->
//--Arrange--
let (array , json) = createAJsonArrayOfString stringArray
//--Act--
let actual = jsonArray
|> ObservableJson.jsonArrayToObservableObjects<string>
|> Observable.ToArray
|> Observable.Wait
//--Assert--
Expect.sequenceEqual actual sArray
]
Regardless of the arrange part, the test is more than the function under test, so it's harder to read than the function under test!
What would be the value of testing when it's harder to read than the production code?
On the other hand:
I wonder whether the functions which are a composition of multiple functions are safe to not to be tested?
Should they be tested at integration and acceptance level?
And what if they are short but do complex operations?
Depends upon your definition of what 'functional programming' is. Or even more precise - upon how close you wanna stay to the origin of functional programming - math with both broad and narrow meanings.
Let's take something relevant to the programming. Say, mappings theory. Your question could be translated in such a way: having a bijection from A to B, and a bijection from B to C, should I prove that composition of those two is a bijection as well? The answer is twofold: you definitely should, and you do it only once: your prove is generic enough to cover all possible cases.
Falling back into programming, it means that pipe-lining has to be tested (proved) only once - and I guess it was before deploy to the production. Since that your job as a programmer to create functions (mappings) of such a quality, that, being composed with a pipeline operator or whatever else, the desired properties are preserved. Once again, it's better to stick with generic arguments rather than write tons of similar tests.
So, finally, we come down to a much more valuable question: how one can guarantee that some operation preserve some property? It turns out that the easiest way to acknowledge such a fact is to deal with types like Monoid from the great Haskell: for example, Monoind is there to represent any associative binary operation A -> A -> A together with some identity-element of type A. Having such a generic containers is extremely profitable and the best known way of being explicit in what and how exactly your code is designed to do.
Personally I would NOT test it.
In fact having less need for testing and instead relying more on stricter compiler rules, side effect free functions, immutability etc. is one major reason why I prefer F# over C#.
of course I continue (unit)testing of "custom logic" ... e.g. algorithmic code

Does property based testing make you duplicate code?

I'm trying to replace some old unit tests with property based testing (PBT), concreteley with scala and scalatest - scalacheck but I think the problem is more general. The simplified situation is , if I have a method I want to test:
def upcaseReverse(s:String) = s.toUpperCase.reverse
Normally, I would have written unit tests like:
assertEquals("GNIRTS", upcaseReverse("string"))
assertEquals("", upcaseReverse(""))
// ... corner cases I could think of
So, for each test, I write the output I expect, no problem. Now, with PBT, it'd be like :
property("strings are reversed and upper-cased") {
forAll { (s: String) =>
assert ( upcaseReverse(s) == ???) //this is the problem right here!
}
}
As I try to write a test that will be true for all String inputs, I find my self having to write the logic of the method again in the tests. In this case the test would look like :
assert ( upcaseReverse(s) == s.toUpperCase.reverse)
That is, I had to write the implementation in the test to make sure the output is correct.
Is there a way out of this? Am I misunderstanding PBT, and should I be testing other properties instead, like :
"strings should have the same length as the original"
"strings should contain all the characters of the original"
"strings should not contain lower case characters"
...
That is also plausible but sounds like much contrived and less clear. Can anybody with more experience in PBT shed some light here?
EDIT : following #Eric's sources I got to this post, and there's exactly an example of what I mean (at Applying the categories one more time): to test the method times in (F#):
type Dollar(amount:int) =
member val Amount = amount
member this.Add add =
Dollar (amount + add)
member this.Times multiplier =
Dollar (amount * multiplier)
static member Create amount =
Dollar amount
the author ends up writing a test that goes like:
let ``create then times should be same as times then create`` start multiplier =
let d0 = Dollar.Create start
let d1 = d0.Times(multiplier)
let d2 = Dollar.Create (start * multiplier) // This ones duplicates the code of Times!
d1 = d2
So, in order to test that a method, the code of the method is duplicated in the test. In this case something as trivial as multiplying, but I think it extrapolates to more complex cases.
This presentation gives some clues about the kind of properties you can write for your code without duplicating it.
In general it is useful to think about what happens when you compose the method you want to test with other methods on that class:
size
++
reverse
toUpperCase
contains
For example:
upcaseReverse(y) ++ upcaseReverse(x) == upcaseReverse(x ++ y)
Then think about what would break if the implementation was broken. Would the property fail if:
size was not preserved?
not all characters were uppercased?
the string was not properly reversed?
1. is actually implied by 3. and I think that the property above would break for 3. However it would not break for 2 (if there was no uppercasing at all for example). Can we enhance it? What about:
upcaseReverse(y) ++ x.reverse.toUpper == upcaseReverse(x ++ y)
I think this one is ok but don't believe me and run the tests!
Anyway I hope you get the idea:
compose with other methods
see if there are equalities which seem to hold (things like "round-tripping" or "idempotency" or "model-checking" in the presentation)
check if your property will break when the code is wrong
Note that 1. and 2. are implemented by a library named QuickSpec and 3. is "mutation testing".
Addendum
About your Edit: the Times operation is just a wrapper around * so there's not much to test. However in a more complex case you might want to check that the operation:
has a unit element
is associative
is commutative
is distributive with the addition
If any of these properties fails, this would be a big surprise. If you encode those properties as generic properties for any binary relation T x T -> T you should be able to reuse them very easily in all sorts of contexts (see the Scalaz Monoid "laws").
Coming back to your upperCaseReverse example I would actually write 2 separate properties:
"upperCaseReverse must uppercase the string" >> forAll { s: String =>
upperCaseReverse(s).forall(_.isUpper)
}
"upperCaseReverse reverses the string regardless of case" >> forAll { s: String =>
upperCaseReverse(s).toLowerCase === s.reverse.toLowerCase
}
This doesn't duplicate the code and states 2 different things which can break if your code is wrong.
In conclusion, I had the same question as you before and felt pretty frustrated about it but after a while I found more and more cases where I was not duplicating my code in properties, especially when I starting thinking about
combining the tested function with other functions (.isUpper in the first property)
comparing the tested function with a simpler "model" of computation ("reverse regardless of case" in the second property)
I have called this problem "convergent testing" but I can't figure out why or where there term comes from so take it with a grain of salt.
For any test you run the risk of the complexity of the test code approaching the complexity of the code under test.
In your case, the the code winds up being basically the same which is just writing the same code twice. Sometimes there is value in that. For example, if you are writing code to keep someone in intensive care alive, you could write it twice to be safe. I wouldn't fault you for the abundance of caution.
For other cases there comes a point where the likelihood of the test breaking invalidates the benefit of the test catching real issues. For that reason, even if it is against best practice in other ways (enumerating things that should be calculated, not writing DRY code) I try to write test code that is in some way simpler than the production code, so it is less likely to fail.
If I cannot find a way to write code simpler than the test code, that is also maintainable(read: "that I also like"), I move that test to a "higher" level(for example unit test -> functional test)
I just started playing with property based testing but from what I can tell it is hard to make it work with many unit tests. For complex units, it can work, but I find it more helpful at functional testing so far.
For functional testing you can often write the rule a function has to satisfy much more simply than you can write a function that satisfies the rule. This feels to me a lot like the P vs NP problem. Where you can write a program to VALIDATE a solution in linear time, but all known programs to FIND a solution take much longer. That seems like a wonderful case for property testing.

Unit testing with random data

I've read that generating random data in unit tests is generally a bad idea (and I do understand why), but testing on random data and then constructing a fixed unit test case from random tests which uncovered bugs seems nice. However I don't understand how to organize it nicely. My question is not related to a specific programming language or to a specific unit test framework actually, so I'll use python and some pseudo unit test framework. Here's how I see coding it:
def random_test_cases():
datasets = [
dataset1,
dataset2,
...
datasetn
]
for dataset in datasets:
assertTrue(...)
assertEquals(...)
assertRaises(...)
# and so on
The problem is: when this test case fails I can't figure out which dataset caused failure. I see two ways of solving it:
Create a single test case per dataset — the problem is load of test cases and code duplication.
Usually test framework lets us pass a message to assert functions (in my example I could do something like assertTrue(..., message = str(dataset))). The problem is that I should pass such a message to each assert, which does not look like elegant too.
Is there a simpler way of doing it?
I still think it's a bad idea.
Unit tests need to be straightforward. Given the same piece of code and the same unit test, you should be able to run it infinitely and never get a different response unless there's an external factor coming in to play. A goal contrary to this will increase maintenance cost of your automation, which defeats the purpose.
Outside of the maintenance aspect, to me it seems lazy. If you put thought in to your functionality and understand the positive as well as the negative test cases, developing unit tests are straightforward.
I also disagree with the user who shows how to do multiple tests cases inside of the same test case. When a test fails, you should be able to tell immediately which test failed and know why it failed. Tests should be as simple as you can make them and as concise/relevant to the code under test as possible.
You could define tests by extension instead of enumeration, or you could call multiple test cases from a single case.
calling multiple test cases from a single test case:
MyTest()
{
MyTest(1, "A")
MyTest(1, "B")
MyTest(2, "A")
MyTest(2, "B")
MyTest(3, "A")
MyTest(3, "B")
}
And there are sometimes elegant ways to achieve this with some testing frameworks. Here is how to do it in NUnit:
[Test, Combinatorial]
public void MyTest(
[Values(1,2,3)] int x,
[Values("A","B")] string s)
{
...
}
I also think it's a bad idea.
Mind you, not throwing random data at your code, but having unit tests doing that. It all boils down to why you unit test in the first place. The answer is "to drive the design of the code". Random data doesn't drive the design of the code, because it depends on a very rigid public interface. Mind you, you can find bugs with it, but that's not what unit tests are about. And let me note that I'm talking about unit tests, and not tests in general.
That being said, I strongly suggest taking a look at QuickCheck. It's Haskell, so it's a bit dodgy on presentation and a bit PhD-ish on documentation, but you should be able to figure it out. I'm going to summarize how it works, though.
After you pick the code you want to test (let's say the sort() function), you establish invariants which should hold. In this examples, you can have the following invariants if result = sort(input):.
Every element in result should be smaller than or equal to the next one.
Every element in input should be present in result the same number of times.
result and input should have the same length (this is repeats the previous, but let's have it for illustration).
You encode each variant in a simple function that takes the result and the output and checks whether those invariants code.
Then, you tell QuickCheck how to generate input. Since this is Haskell and the type system kicks ass, it can see that the function takes a list of integers and it knows how to generate those. It basically generates random lists of random integers and random length. Of course, it can be more fine-grained if you have a more complex data type (for example, only positive integers, only squares, etc.).
Finally, when you have those two, you just run QuickCheck. It generates all that stuff randomly and checks the invariants. If some fail, it will show you exactly which ones. It would also tell you the random seed, so you can rerun this exact failure if you need to. And as an extra bonus, whenever it gets a failed invariant, it will try to reduce the input to the smallest possible subset that fails the invariant (if you think of a tree structure, it will reduce it to the smallest subtree that fails the invariant).
And there you have it. In my opinion, this is how you should go about testing stuff with random data. It's definitely not unit tests and I even think you should run it differently (say, have CI run it every now and then, as opposed to running it on every change (since it will quickly get slow)). And let me repeat, it's a different benefit from unit testing - QuickCheck finds bugs, while unit testing drives design.
Usually the unit test frameworks support 'informative failures' as long as you pick the right assertion method.
However if everything else doesn't work, You could easily trace the dataset to the console/output file. Low tech but should work.
[TestCaseSource("GetDatasets")]
public Test.. (Dataset d)
{
Console.WriteLine(PrettyPrintDataset(d));
// proceed with checks
Console.WriteLine("Worked!");
}
In quickcheck for R we tried to solve this problem as follows
the tests are actually pseudo-random (the seed is fixed) so you can always reproduce your tests results (barring external factors, of course)
the test function returns enough data to reproduce the error, including the assertion that failed and the data that made it fail. A convenience function, repro, called on the return value of test will land you in the debugger at the beginning of the failing assertion, with arguments set to the witnesses of the failure. If the tests are executed in batch mode, equivalent information is stored in a file and the command to retrieve it is printed in stderr. Then you can call repro as before. Whether or not you program in R, I would love to know if this starts to address you requirements. Some aspects of this solution may be hard to implement in languages that are less dynamic or don't have first class functions.

how do I avoid re-implementing the code being tested when I write tests?

(I'm using rspec in RoR, but I believe this question is relevant to any testing system.)
I often find myself doing this kind of thing in my test suite:
actual_value = object_being_tested.tested_method(args)
expected_value = compute_expected_value(args)
assert(actual_value == expected_value, ...)
The problem is that my implementation of compute_expected_value() often ends up mimicking object_being_tested.tested_method(), so it's really not a good test because they may have identical bugs.
This is a rather open-ended question, but what techniques do people use to avoid this trap? (Points awarded for pointers to good treatises on the topic...)
Usually (for manually written unit tests) you would not compute the expected value. Rather, you would just assert against what you expect to be the result from the tested method for the given args. That is, you would have something like this:
actual_value = object_being_tested.tested_method(args)
expected_value = what_you_expect_to_be_the_result
assert(actual_value == expected_value, ...)
In other testing scenarios where the arguments (or even test methods being executed) are generated automatically, you need to devise a simple oracle which will give you the expected result (or an invariant that should hold for the expected result). Tools like Pex, Randoop, ASTGen, and UDITA enable such testing.
Well here are my two cents
a) if the calculation of the expected value is simple and does not encompass any business rules/conditions in there apart from the test case to which it is generating the expected result then it should be good enough... remember your actual code will be as generic as possible.
Well there are cases where you will run into issues in the expected method but you can easily pin point the cos of failure and fix it.
b) there are cases when the expected value cannot be easily calculated in that case probably have flat files with results or probably some kind of constant expected value as naturally you would want that.
Also then there are tests where in you just want to verify whether a particular method was called or not and you are done testing that unit.. remember to use all these different paradigms while testing and always remember KEEP IT SIMPLE
you would not do that.
you do not compute the expected value, you know it already. it should be a constant value defined in your test. (or is constructed from other functions that have already been tested.)

Automated testing feels a lot like duplicating the tested logic, am I doing it right?

I'm implementing automated testing with CppUTest in C++.
I realize I end up almost copying and pasting the logic to be tested on the tests themselves, so I can check the expected outcomes.
Am I doing it right? should it be otherwise?
edit: I'll try to explain better:
The unit being tested takes input A, makes some processing and returns output B
So apart from making some black box checks, like checking that the output lies in an expectable range, I would also like to see if the output B that I got is the right outcome for input A I.E. if the logic is working as expected.
So for example if the unit just makes A times 2 to yield B, then in the test I have no other way of checking than making again the calculation of A times 2 to check against B to be sure it went alright.
That's the duplication I'm talking about.
// Actual function being tested:
int times2( int a )
{
return a * 2;
}
.
// Test:
int test_a;
int expected_b = test_a * 2; // here I'm duplicating times2()'s logic
int actual_b = times2( test_a );
CHECK( actual_b == expected_b );
.
PS: I think I will reformulate this in another question with my actual source code.
If your goal is to build automated tests for your existing code, you're probably doing it wrong. Hopefully you know what the result of frobozz.Gonkulate() should be for various inputs and can write tests to check that Gonkulate() is returning the right thing. If you have to copy Gonkulate()'s convoluted logic to figure out the answer, you might want to ask yourself how well you understand the logic to begin with.
If you're trying to do test-driven development, you're definitely doing it wrong. TDD consists of many quick cycles of:
Writing a test
Watching it fail
Making it pass
Refactoring as necessary to improve the overall design
Step 1 - writing the test first - is an essential part of TDD. I infer from your question that you're writing the code first and the tests later.
So for example if the unit just makes A times 2 to yield B, then in
the test I have no other way of checking than making again the
calculation of A times 2 to check against B to be sure it went
alright.
Yes you do! You know how to calculate A times two, so you don't need to do this in code. if A is 4 then you know the answer is 8. So you can just use it as the expected value.
CHECK( actual_b == 8 )
if you are worried about magic numbers, don't be. Nobody will be confused about the meaning of the hard coded numbers in the following line:
CHECK( times_2(4) == 8 )
If you don't know what the result should be then your unit test is useless. If you need to calculate the expected result, then you are either using the same logic as the function, or using an alternate algorithm to work out the result.In the first case, if the logic that you duplicate is incorrect, your test will still pass! In the second case, you are introducing another place for a bug to occur. If a test fails, you will need to work out whether it failed because the function under test has a bug, or if your test method has a bug.
I think this one is a though to crack because it's essentially a mentality shift. It was somewhat hard for me.
The thing about tests is to have your expectancies nailed down and check if your code really does what you think it does. Think in ways of exercising it, not checking its logic so directly, but as a whole. If that's too hard, maybe your function/method just does too much.
Try to think of your tests as working examples of what your code can do, not as a mathematical proof.
The programming language shouldn't matter.
var ANY_NUMBER = 4;
Assert.That(times_2(ANY_NUMBER), Is.EqualTo(ANY_NUMBER*2)
In this case, I wouldn't mind duplicating the logic. The expected value is readable as compared to 8. Second this logic doesn't look like a change-magnet. Relatively static.
For cases, where the logic is more involved (chunky) and prone to change, duplicating the logic in the test is definitely not recommended. Duplication is evil. Any change to the logic would ripple changes to the test. In that case, I'd use hardcoded input-expected output pairs with some readable pair-names.