Number of test-cases for a boolean function

Number of test-cases for a boolean function - unit-testing

I'm confused about the number of test cases used for a boolean function. Say I'm writing a function to check whether the sale price of something is over $60 dollars.
function checkSalePrice(price) {
return (price > 60)
}
In my Advance Placement course, they ask the minimum # of test include boundary values. So in this case, the an example set of tests are [30, 60, 90]. This course I'm taking says to only test two values, lower and higher, eg (30, 90)
Which is correct? (I know this is pondering the depth of a cup of water, but I'd like to get a few more samples as I'm new to programming)

Kent Beck wrote
I get paid for code that works, not for tests, so my philosophy is to test as little as possible to reach a given level of confidence (I suspect this level of confidence is high compared to industry standards, but that could just be hubris). If I don't typically make a kind of mistake (like setting the wrong variables in a constructor), I don't test for it. I do tend to make sense of test errors, so I'm extra careful when I have logic with complicated conditionals. When coding on a team, I modify my strategy to carefully test code that we, collectively, tend to get wrong.
Me? I make fence post errors. So I would absolutely want to be sure that my test suite would catch the following incorrect implementation of checkSalePrice
function checkSalePrice(price) {
return (price >= 60)
}
If I were writing checkSalePrice using test-driven-development, then I would want to calibrate my tests by ensuring that they fail before I make them pass. Since in my programming environment a trivial boolean function returns false, my flow would look like
assert checkSalePrice(61)
This would fail, because the method by default returns false. Then I would implement
function checkSalePrice(price) {
return true
}
Now my first check passes, so I know what this boundary case is correctly covered. I would then add a new check
assert ! checkSalePrice(60)
which would fail. Providing the corrected implementation would pass the check, and now I can confidently refactor the method as necessary.
Adding a third check here for an arbitrary value isn't going to provide additional safety when changing the code, nor is it going to make the life of the next maintainer any easier, so I would settle for two cases here.
Note that the heuristic I'm using is not related to the complexity of the returned value, but the complexity of the method
Complexity of the predicate might include covering various problems reading the input. For instance, if we were passing a collection, what cases do we want to make sure are covered? J. B. Rainsberger suggested the following mnemonic
zero
one
many
lots
oops
Bruce Dawson points out that there are only 4 billion floats, so maybe you should [test them all].
Do note, though, that those extra 4 billion minus two checks aren't adding a lot of design value, so we've probably crossed from TDD into a different realm.

You stumbled into on of the big problems with testing in general - how many tests are good enough?!
There are basically three ways to look at this:
black box testing: you do not care about the internals of your MuT (method under test). You only focus on the contract of the method. In your case: should return return true when price > 60. When you think about this for while, you would find tests 30 and 90 ... and maybe 60 as well. It is always good practice to test corner cases. So the answer would be: 3
white box testing: you do coverage measurements of your tests - and you strive for example to hit all paths at least once. In this case, you could go with 30 and 90 - which would be resulting in 100% coverage: So the answer here: 2
randomized testing, as guided by QuickCheck. This approach is very much different: you don't specify test cases at all. Instead you step back and identify rules that should hold true about your MuT. And then the framework creates random input and invokes your MuT using that - trying to find examples where the aforementioned rules break.
In your case, such a rule could be that: when checkSalePrice(a) and checkSalePrice(b) then checkSalePrice(a+b). This approach feels unusual first, but as soon as start exploring its possibilities, you can find very interesting things in it. Especially when you understand that your code can provide the required "creator" functions to the framework. That allows you to use this approach to even test much more complicated, "object oriented" stuff. It is just great to watch the framework find a flaw - and to then realize that the framework will even find the "minimum" example data required to break a rule that you specified.

Related

What to consider when looking at multiple methods to achieve the same result while coding?

I am currently coding in C++, creating an all rounded calculator that, when finished, will be capable of handling all major and common mathematical procedures.
The current wall I am hitting is from the fact I am still learning about to profession we call being a programmer.
I have several ways of achieving a single result. I am curious as to whether I should pick the method that has a clear breakdown of how it got to that point in the code; or the method that is much shorter - while not sacrificing any of the redability.
Below I have posted snippets from my class showing what I mean.
This function uses if statements to determine whether or not a common denominator is even needed, but is several lines long.
Fraction Fraction::addFraction(Fraction &AddInput)
{
Fraction output;
if (m_denominator != AddInput.m_denominator)
{
getCommonDenominator(AddInput);
output.setWhole(m_whole + AddInput.m_whole);
output.setNumerator((m_numerator * firstchange) + (AddInput.m_numerator * secondchange));
output.setDenominator(commondenominator);
}
else
{
output.setWhole(m_whole + AddInput.m_whole);
output.setNumerator(m_numerator + AddInput.m_numerator);
output.setDenominator(m_denominator);
}
output.simplify();
return output;
}
This function below, gets a common denominator; repeats the steps on the numerators; then simplifies to the lowest terms.
Fraction Fraction::addFraction(Fraction &AddInput)
{
getCommonDenominator(AddInput);
Fraction output(m_whole + AddInput.m_whole, (m_numerator * firstchange) + (AddInput.m_numerator * secondchange), commondenominator);
output.simplify();
return output;
}
Both functions have been tested and always return the accurate result. When it comes to coding standards... do we pick longer and asy to follow? or shorter and easy to understand?

Your first priority with your code should be that it's correct.
Your second priority with code should be "If someone who's never seen this before is going to make a tiny change, which one is he less likely to break?
There's actually a lot that goes into this. How difficult is it to understand at a high level? How abstracted out are arcane details? Are there any surprises? What quirks do you have to know about? Are there edge cases that have to be handled?
The reasons that this second priority is important are:
it's key to preventing you from writing bugs in the first place
it's easier to find bugs later
it's easier to fix bugs later
despite whatever you think, you won't remember the details in 6 months.
Both implementations appear about equally difficult in complexity per branch, but the first one has branches, so I'd lean toward the second for understandability. Details seem abstracted out in both, and if there's surprises or quirks, I don't immediately see them (but that's sort of the point, that they can be easily overlooked). I don't see any special handling for edge cases, so if edge cases exist in either, comments would be good.
Unrelated to picking, but while on the topic of reviewing code, It's unclear how either handles fractions that have no fractional part, but that might be part of the full class documentation, which would be fine. Both codepaths take AddArgument by mutable reference, which is bad, and require this to be mutable as well, which is also bad. Both have methods named get*() that appear to modify (getCommonDenominator), which is bad. The code appears to be using variables that are external (firstchange? secondchange?) which is a major strike against preventing bugs.

Use of Literals, yay/nay in C++

I've recently heard that in some cases, programmers believe that you should never use literals in your code. I understand that in some cases, assigning a variable name to a given number can be helpful (especially in terms of maintenance if that number is used elsewhere). However, consider the following case studies:
Case Study 1: Use of Literals for "special" byte codes.
Say you have an if statement that checks for a specific value stored in (for the sake of argument) a uint16_t. Here are the two code samples:
Version 1:
// Descriptive comment as to why I'm using 0xBEEF goes here
if (my_var == 0xBEEF) {
//do something
}
Version 2:
const uint16_t kSuperDescriptiveVarName = 0xBEEF;
if (my_var == kSuperDescriptiveVarName) {
// do something
}
Which is the "preferred" method in terms of good coding practice? I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once. Also, does the compiler do any optimizations to make both versions effectively the same executable code? That is, are there any performance implications here?
Case Study 2: Use of sizeof
I fully understand that using sizeof versus a raw literal is preferred for portability and also readability concerns. Take the two code examples into account. The scenario is that you are computing the offset into a packet buffer (an array of uint8_t) where the first part of the packet is stored as my_packet_header, which let's say is a uint32_t.
Version 1:
const int offset = sizeof(my_packet_header);
Version 2:
const int offset = 4; // good comment telling reader where 4 came from
Clearly, version 1 is preferred, but what about for cases where you have multiple data fields to skip over? What if you have the following instead:
Version 1:
const int offset = sizeof(my_packet_header) + sizeof(data_field1) + sizeof(data_field2) + ... + sizeof(data_fieldn);
Version 2:
const int offset = 47;
Which is preferred in this case? Does is still make sense to show all the steps involved with computing the offset or does the literal usage make sense here?
Thanks for the help in advance as I attempt to better my code practices.

Which is the "preferred" method in terms of good coding practice? I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once.
Sounds like you understand the main point... factoring values (and their comments) that are used in multiple places. Further, it can sometimes help to have a group of constants in one place - so their values can be inspected, verified, modified etc. without concern for where they're used in the code. Other times, there are many constants used in proximity and the comments needed to properly explain them would obfuscate the code in which they're used.
Countering that, having a const variable means all the programmers studying the code will be wondering whether it's used anywhere else, keeping it in mind as they inspect the rest of the scope in which it's declared etc. - the less unnecessary things to remember the surer the understanding of important parts of the code will be.
Like so many things in programming, it's "an art" balancing the pros and cons of each approach, and best guided by experience and knowledge of the way the code's likely to be studied, maintained, and evolved.
Also, does the compiler do any optimizations to make both versions effectively the same executable code? That is, are there any performance implications here?
There's no performance implications in optimised code.
I fully understand that using sizeof versus a raw literal is preferred for portability and also readability concerns.
And other reasons too. A big factor in good programming is reducing the points of maintenance when changes are done. If you can modify the type of a variable and know that all the places using that variable will adjust accordingly, that's great - saves time and potential errors. Using sizeof helps with that.
Which is preferred [for calculating offsets in a struct]? Does is still make sense to show all the steps involved with computing the offset or does the literal usage make sense here?
The offsetof macro (#include <cstddef>) is better for this... again reducing maintenance burden. With the this + that approach you illustrate, if the compiler decides to use any padding your offset will be wrong, and further you have to fix it every time you add or remove a field.
Ignoring the offsetof issues and just considering your this + that example as an illustration of a more complex value to assign, again it's a balancing act. You'd definitely want some explanation/comment/documentation re intent here (are you working out the binary size of earlier fields? calculating the offset of the next field?, deliberately missing some fields that might not be needed for the intended use or was that accidental?...). Still, a named constant might be enough documentation, so it's likely unimportant which way you lean....

In every example you list, I would go with the name.
In your first example, you almost certainly used that special 0xBEEF number at least twice - once to write it and once to do your comparison. If you didn't write it, that number is still part of a contract with someone else (perhaps a file format definition).
In the last example, it is especially useful to show the computation that yielded the value. That way, if you encounter trouble down the line, you can easily see either that the number is trustworthy, or what you missed and fix it.
There are some cases where I prefer literals over named constants though. These are always cases where a name is no more meaningful than the number. For example, you have a game program that plays a dice game (perhaps Yahtzee), where there are specific rules for specific die rolls. You could define constants for One = 1, Two = 2, etc. But why bother?

Generally it is better to use a name instead of a value. After all, if you need to change it later, you can find it more easily. Also it is not always clear why this particular number is used, when you read the code, so having a meaningful name assigned to it, makes this immediately clear to a programmer.
Performance-wise there is no difference, because the optimizers should take care of it. And it is rather unlikely, even if there would be an extra instruction generated, that this would cause you troubles. If your code would be that tight, you probably shouldn't rely on an optimizer effect anyway.

I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once.
I think kSuperDescriptiveVarName will definitely be used more than once. One for check and at least one for assignment, maybe in different part of your program.
There will be no difference in performance, since an optimization called Constant Propagation exists in almost all compilers. Just enable optimization for your compiler.

Testing a function: what more should be tested?

I am writing a function that takes three integer inputs and based on a relation between the three, it returns a value or error. To test this, I have written some test cases which include testing illegal values, boundary conditions for integers including overflows and some positive tests too. I am wondering what else should be tested for this simple function?
Can testing on different platforms make sense as a test case for such a small function?
Also, testing execution times is another thing that I wanted to add as a test case.
Can doing static and dynamic analysis be a part of the test cases?
Anything else that should be tested?
int foo(int a, int b, int c) {
return a value based on a, b, and c.
}

The way you ask your question it seems you are doing a black box test, i.e. you only know about the relation between input and output, and not about the implementation. In that case your test case should depend on what you know about the relation, and I think you have tested these things (you didn't give us details on the relation).
From that it doesn't look as if you need to test for platform independence, but if you have an automated test suite, it is for sure not a bad idea to test it on different platforms.
Now if you have the code available, you could go for white box tests. Typically you would do this by looking at your code structure first, i.e. you could try to have 100% branching coverage, i.e. every branch in your code is at least run once during the tests. In that way, static and dynamic analysis could help you to find different coverage measures.
I wouldn't go for a platform independency test if there is no platform dependent code in your function.

sizeof(int) must be tested for the particular compiler. Although this seems trivial and C standard specifies the size for an int, its always better to know if the compiler being used is a 16 bit standard-noncomformant compiler. Just another test case.

What type of errors could my code still contain even if I have 100% code coverage?

What type of errors could my code still contain even if I have 100% code coverage? I'm looking for concrete examples or links to concrete examples of such errors.

Having 100% code coverage is not that great as one may think of it. Consider a trival example:
double Foo(double a, double b)
{
return a / b;
}
Even a single unit test will raise code coverage of this method to 100%, but the said unit test will not tell us what code is working and what code is not. This might be a perfectly valid code, but without testing edge conditions (such as when b is 0.0) unit test is inconclusive at best.
Code coverage only tells us what was executed by our unit tests, not whether it was executed correctly. This is an important distinction to make. Just because a line of code is executed by a unit test, does not necessarily mean that that line of code is working as intended.
Listen to this for an interesting discussion.

Code coverage doesn't mean that your code is bug free in any way. It's an estimate on how well you're test cases cover your source code base. 100% code coverage would imply that every line of code is tested but every state of your program certainly is not. There's research being done in this area, I think it's referred to as finite state modeling but it's really a brute force way of trying to explore every state of a program.
A more elegant way of doing the same thing is something referred to as abstract interpretation. MSR (Microsoft Research) have released something called CodeContracts based on abstract interpretation. Check out Pex as well, they really emphasis cleaver methods of testing application run-time behavior.
I could write a really good test which would give me good coverage, but there's no guarantees that that test will explore all the states that my program might have. This is the problem of writing really good tests, which is hard.
Code coverage does not imply good tests

Uh? Any kind of ordinary logic bug, I guess? Memory corruption, buffer overrun, plain old wrong code, assignment-instead-of-test, the list goes on. Coverage is only that, it lets you know that all code paths are executed, not that they are correct.

As I haven't seen it mentioned yet, I'd like to add this thread that code coverage does not tell you what part of your code is bugfree.
It only tells you what parts of your code is guaranteed to be untested.

1. "Data space" problems
Your (bad) code:
void f(int n, int increment)
{
while(n < 500)
{
cout << n;
n += increment;
}
}
Your test:
f(200,100);
Bug in real world use:
f(200,0);
My point: Your test may cover 100% of the lines of your code but it will not (typically) cover all your possible input data space, i.e. the set of all possible values of inputs.
2. Testing against your own mistake
Another classical example is when you just take a bad decision in design, and test your code against your own bad decision.
E.g. The specs document says "print all prime numbers up to n" and you print all prime numbers up to n but excluding n. And your unit tests test your wrong idea.
3. Undefined behaviour
Use the value of uninitialized variables, cause an invalid memory access, etc. and your code has undefined behaviour (in C++ or any other language that contemplates "undefined behaviour"). Sometimes it will pass your tests, but it will crash in the real world.
...

There can always be runtime exceptions: memory filling up, database or other connections not being closed etc...

Consider the following code:
int add(int a, int b)
{
return a + b;
}
This code could fail to implement some necessary functionality (i.e. not meet an end-user requirements): "100% coverage" doesn't necessarily test/detect functionality which ought to be implemented but which isn't.
This code could work for some but not all input data ranges (e.g. when a and b are both very large).

Code coverage doesn't mean anything, if your tests contain bugs, or you are testing the wrong thing.
As a related tangent; I'd like to remind to you that I can trivially construct an O(1) method that satisfies the following pseudo-code test:
sorted = sort(2,1,6,4,3,1,6,2);
for element in sorted {
if (is_defined(previousElement)) {
assert(element >= previousElement);
}
previousElement = element;
}
bonus karma to Jon Skeet, who pointed out the loophole I was thinking about

Code coverage usually only tells you how many of the branches within a function are covered. It doesn't usually report the various paths that could be taken between function calls. Many errors in programs happen because the handoff from one method to another is wrong, not because the methods themselves contain errors. All bugs of this form could still exist in 100% code coverage.

In a recent IEEE Software paper "Two Mistakes and Error-Free Software: A Confession", Robert Glass argued that in the "real world" there are more bugs caused by what he calls missing logic or combinatorics (which can't be guarded against with code coverage tools) than by logic errors (which can).
In other words, even with 100% code coverage you still run the risk of encountering these kinds of errors. And the best thing you can do is--you guessed it--do more code reviews.
The reference to the paper is here and I found a rough summary here.

works on my machine
Many things work well on local machine and we cannot assure that to work on Staging/Production. Code Coverage may not cover this.

Errors in tests :)

Well if your tests don't test the thing which happens in the code covered.
If you have this method which adds a number to properties for example:
public void AddTo(int i)
{
NumberA += i;
NumberB -= i;
}
If your test only checks the NumberA property, but not NumberB, then you will have 100% coverage, the test passes, but NumberB will still contain an error.
Conclusion: a unit test with 100% will not guarantee that the code is bug-free.

Argument validation, aka. Null Checks. If you take any external inputs and pass them into functions but never check if they are valid/null, then you can achieve 100% coverage, but you will still get a NullReferenceException if you somehow pass null into the function because that's what your database gives you.
also, arithmetic overflow, like
int result = int.MAXVALUE + int.MAXVALUE;
Code Coverage only covers existing code, it will not be able to point out where you should add more code.

I don't know about anyone else, but we don't get anywhere near 100% coverage. None of our "This should never happen" CATCHes get exercised in our tests (well, sometimes they do, but then the code gets fixed so they don't any more!). I'm afraid I don't worry that there might be a Syntax/Logic error in a never-happen-CATCH

Your product might be technically correct, but not fulfil the needs of the customer.

FYI, Microsoft Pex attempts to help out by exploring your code and finding "edge" cases, like divide by zero, overflow, etc.
This tool is part of VS2010, though you can install a tech preview version in VS2008. It's pretty remarkable that the tool finds the stuff it finds, though, IME, it's still not going to get you all the way to "bulletproof".

Code Coverage doesn't mean much. What matters is whether all (or most) of the argument values that affect the behavior are covered.
For eg consider a typical compareTo method (in java, but applies in most languages):
//Return Negative, 0 or positive depending on x is <, = or > y
int compareTo(int x, int y) {
return x-y;
}
As long as you have a test for compareTo(0,0), you get code coverage. However, you need at least 3 testcases here (for the 3 outcomes). Still it is not bug free. It also pays to add tests to cover exceptional/error conditions. In the above case, If you try compareTo(10, Integer.MAX_INT), it is going to fail.
Bottomline: Try to partition your input to disjoint sets based on behavior, have a test for at least one input from each set. This will add more coverage in true sense.
Also check for tools like QuickCheck (If available for your language).

Perform a 100% code coverage, i.e., 100% instructions, 100% input and output domains, 100% paths, 100% whatever you think of, and you still may have bugs in your code: missing features.

As mentioned in many of the answers here, you could have 100% code coverage and still have bugs.
On top of that, you can have 0 bugs but the logic in your code may be incorrect (not matching the requirements). Code coverage, or being 100% bug-free can't help you with that at all.
A typical corporate software development practice could be as follows:
Have a clearly written functional specification
Have a test plan that's written against (1) and have it peer reviewed
Have test cases written against (2) and have them peer reviewed
Write code against the functional specification and have it peer reviewed
Test your code against the test cases
Do code coverage analysis and write more test cases to achieve good coverage.
Note that I said "good" and not "100%". 100% coverage may not always be feasible to achieve -- in which case your energy is best spent on achieving correctness of code, rather than the coverage of some obscure branches. Different sorts of things can go wrong in any of the steps 1 through 5 above: wrong idea, wrong specification, wrong tests, wrong code, wrong test execution... The bottom line is, step 6 alone is not the most important step in the process.
Concrete example of wrong code that doesn't have any bugs and has 100% coverage:
/**
* Returns the duration in milliseconds
*/
int getDuration() {
return end - start;
}
// test:
start = 0;
end = 1;
assertEquals(1, getDuration()); // yay!
// but the requirement was:
// The interface should have a method for returning the duration in *seconds*.

Almost anything.
Have you read Code Complete? (Because StackOverflow says you really should.) In Chapter 22 it says "100% statement coverage is a good start, but it is hardly sufficient". The rest of the chapter explains how to determine which additional tests to add. Here's a brief taster.
Structured basis testing and data flow testing means testing each logic path through the program. There are four paths through the contrived code below, depending on the values of A and B. 100% statement coverage could be achieved by testing only two of the four paths, perhaps f=1:g=1/f and f=0:g=f+1. But f=0:g=1/f will give a divide by zero error. You have to consider if statements and while and for loops (the loop body might never be executed) and every branch of a select or switch statement.
If A Then
f = 1
Else
f = 0
End If
If B Then
g = f + 1
Else
g = f / 0
End If
Error guessing - informed guesses about types of input that often cause errors. For instance boundary conditions (off by one errors), invalid data, very large values, very small values, zeros, nulls, empty collections.
And even so there can be errors in your requirements, errors in your tests, etc - as others have mentioned.

Unit testing a method that can have random behaviour

I ran across this situation this afternoon, so I thought I'd ask what you guys do.
We have a randomized password generator for user password resets and while fixing a problem with it, I decided to move the routine into my (slowly growing) test harness.
I want to test that passwords generated conform to the rules we've set out, but of course the results of the function will be randomized (or, well, pseudo-randomized).
What would you guys do in the unit test? Generate a bunch of passwords, check they all pass and consider that good enough?

A unit test should do the same thing every time that it runs, otherwise you may run into a situation where the unit test only fails occasionally, and that could be a real pain to debug.
Try seeding your pseudo-randomizer with the same seed every time (in the test, that is--not in production code). That way your test will generate the same set of inputs every time.
If you can't control the seed and there is no way to prevent the function you are testing from being randomized, then I guess you are stuck with an unpredictable unit test. :(

The function is a hypothesis that for all inputs, the output conforms to the specifications. The unit test is an attempt to falsify that hypothesis. So yes, the best you can do in this case is to generate a large amount of outputs. If they all pass your specification, then you can be reasonably sure that your function works as specified.
Consider putting the random number generator outside this function and passing a random number to it, making the function deterministic, instead of having it access the random number generator directly. This way, you can generate a large number of random inputs in your test harness, pass them all to your function, and test the outputs. If one fails, record what that value is so that you have a documented test case.

In addition to testing a few to make sure that they pass, I'd write a test to make sure that passwords that break the rules fail.
Is there anything in the codebase that's checking the passwords generated to make sure they're random enough? If not, I may look at creating the logic to check the generated passwords, testing that, and then you can state that the random password generator is working (as "bad" ones won't get out).
Once you've got that logic you can probably write an integration type test that would generate boatloads of passwords and pass it through the logic, at which point you'd get an idea of how "good" your random password generate is.

Well, considering they are random, there is no really way to make sure, but testing for 100 000 password should clear most doubts :)

You could seed your random number generator with a constant value in order to get non-random results and test those results.

I'm assuming that the user-entered passwords conform to the same restrictions as the random generated ones. So you probably want to have a set of static passwords for checking known conditions, and then you'll have a loop that does the dynamic password checks. The size of the loop isn't too important, but it should be large enough that you get that warm fuzzy feeling from your generator, but not so large that your tests take forever to run. If anything crops up over time, you can add those cases to your static list.
In the long run though, a weak password isn't going to break your program, and password security falls in the hands of the user. So your priority would be to make sure that the dynamic generation and strength-check doesn't break the system.

Without knowing what your rules are it's hard to say for sure, but assuming they are something like "the password must be at least 8 characters with at least one upper case letter, one lower case letter, one number and one special character" then it's impossible even with brute force to check sufficient quantities of generated passwords to prove the algorithm is correct (as that would require somewhere over 8^70 = 1.63x10^63 checks depending on how many special characters you designate for use, which would take a very, very long time to complete).
Ultimately all you can do is test as many passwords as is feasible, and if any break the rules then you know the algorithm is incorrect. Probably the best thing to do is leave it running overnight, and if all is well in the morning you're likely to be OK.
If you want to be doubly sure in production, then implement an outer function that calls the password generation function in a loop and checks it against the rules. If it fails then log an error indicating this (so you know you need to fix it) and generate another password. Continue until you get one that meets the rules.

You can also look into mutation testing (Jester for Java, Heckle for Ruby)

In my humble opinion you do not want a test that sometimes pass and sometimes fails. Some people may even consider that this kind of test is not a unit test. But the main idea is be sure that the function is OK when you see the green bar.
With this principle in mind you may try to execute it a reasonable number of times so that the chance of having a false correct is almost cero. However, une single failure of the test will force you to make more extensive tests apart from debbuging the failure.

Either use fixed random seed or make it reproducible (i.e.: derive from the current day)

Firstly, use a seed for your PRNG. Your input is no longer random and gets rid of the problem of unpredictable output - i.e. now your unit test is deterministic.
This doesn't however solve the problem of testing the implementation, but here is an example of how a typical method that relies upon randomness can be tested.
Imagine we've implemented a function that takes a collection of red and blue marbles and picks one at random, but a weighting can be assigned to the probability, i.e. weights of 2 and 1 would mean red marbles are twice as likely to be picked as blue marbles.
We can test this by setting the weight of one choice to zero and verifying that in all cases (in practice, for a large amount of test input) we always get e.g. blue marbles. Reversing the weights should then give the opposite result (all red marbles).
This doesn't guarantee our function is behaving as intended (if we pass in an equal number of red and blue marbles and have equal weights do we always get a 50/50 distribution over a large number of trials?) but in practice it is often sufficient.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js