What can be alternative metrics to code coverage? - unit-testing

Code coverage is propably the most controversial code metric. Some say, you have to reach 80% code coverage, other say, it's superficial and does not say anything about your testing quality. (See Jon Limjap's good answer on "What is a reasonable code coverage % for unit tests (and why)?".)
People tend to measure everything. They need comparisons, benchmarks etc.
Project teams need a pointer, how good their testing is.
So what are alternatives to code coverage? What would be a good metric that says more than "I touched this line of code"?
Are there real alternatives?

If you are looking for some useful metrics that tell you about the quality (or lack there of) of your code, you should look into the following metrics:
Cyclomatic Complexity
This is a measure of how complex a method is.
Usually 10 and lower is good, 11-25 is poor, higher is terrible.
Nesting Depth
This is a measure of how many nested scopes are in a method.
Usually 4 and lower is good, 5-8 is poor, higher is terrible.
Relational Cohesion
This is a measure of how well related the types in a package or assembly are.
Relational cohesion is somewhat of a relative metric, but useful none the less.
Acceptable levels depends on the formula. Given the following:
R: number of relationships in package/assembly
N: number of types in package/assembly
H: Cohesion of relationship between types
Formula: H = (R+1)/N
Given the above formula, acceptable range is 1.5 - 4.0
Lack of Cohesion of Methods (LCOM)
This is a measure of how cohesive a class is.
Cohesion of a class is a measure of how many fields each method references.
Good indication of whether your class meets the Principal of Single Responsibility.
Formula: LCOM = 1 - (sum(MF)/M*F)
M: number of methods in class
F: number of instance fields in class
MF: number of methods in class accessing a particular instance field
sum(MF): the sum of MF over all instance fields
A class that is totally cohesive will have an LCOM of 0.
A class that is completely non-cohesive will have an LCOM of 1.
The closer to 0 you approach, the more cohesive, and maintainable, your class.
These are just some of the key metrics that NDepend, a .NET metrics and dependency mapping utility, can provide for you. I recently did a lot of work with code metrics, and these 4 metrics are the core key metrics that we have found to be most useful. NDepend offers several other useful metrics, however, including Efferent & Afferent coupling and Abstractness & Instability, which combined provide a good measure of how maintainable your code will be (and whether or not your in what NDepend calls the Zone of Pain or the Zone of Uselessness.)
Even if you are not working with the .NET platform, I recommend taking a look at the NDepend metrics page. There is a lot of useful information there that you might be able to use to calculate these metrics on whatever platform you develop on.

Crap4j is one fairly good metrics that I'm aware of...
Its a Java implementation of the Change Risk Analysis and Predictions software metric which combines cyclomatic complexity and code coverage from automated tests.

Bug metrics are also important:
Number of bugs coming in
Number of bugs resolved
To detect for instance if bugs are not resolved as fast as new come in.

What about watching the trend of code coverage during your project?
As it is the case with many other metrics a single number does not say very much.
For example it is hard to tell wether there is a problem if "we have a Checkstyle rules compliance of 78.765432%". If yesterday's compliance was 100%, we are definitely in trouble. If it was 50% yesterday, we are probably doing a good job.
I alway get nervous when code coverage has gotten lower and lower over time. There are cases when this is okay, so you cannot turn off your head when looking at charts and numbers.
BTW, sonar (http://sonar.codehaus.org/) is a great tool for watching trends.

Using code coverage on it's own is mostly pointless, it gives you only insight if you are looking for unnecessary code.
Using it together with unit-tests and aiming for 100% coverage will tell you that all the 'tested' parts (assumed it was all successfully too) work as specified in the unit-test.
Writing unit-tests from a technical design/functional design, having 100% coverage and 100% successful tests will tell you that the program is working like described in the documentation.
Now the only thing you need is good documentation, especially the functional design, a programmer should not write that unless (s)he is an expert of that specific field.

Scenario coverage.
I don't think you really want to have 100% code coverage. Testing say, simple getters and setters looks like a waste of time.
The code always runs in some context, so you may list as many scenarios as you can (depending on the problem complexity sometimes even all of them) and test them.
Example:
// parses a line from .ini configuration file
// e.g. in the form of name=value1,value2
List parseConfig(string setting)
{
(name, values) = split_string_to_name_and_values(setting, '=')
values_list = split_values(values, ',')
return values_list
}
Now, you have many scenarios to test. Some of them:
Passing correct value
List item
Passing null
Passing empty string
Passing ill-formated parameter
Passing string with with leading or ending comma e.g. name=value1, or name=,value2
Running just first test may give you (depending on the code) 100% code coverage. But you haven't considered all the posibilities, so that metric by itself doesn't tell you much.

Code Coverage is just an indicator and helps pointing out lines which are not executed at all in your tests, which is quite interesting. If you reach 80% code coverage or so, it starts making sense to look at the remaining 20% of lines to identify if you are missing some use case. If you see "aha, this line gets executed if I pass an empty vector" then you can actually write a test which passes an empty vector.
As an alternative I can think of, if you have a specs document with Use Cases and Functional Requirements, you should map the unit tests to them and see how many UC are covered by FR (of course it should be 100%) and how many FR are covered by UT (again, it should be 100%).
If you don't have specs, who cares? Anything that happens will be ok :)

How about (lines of code)/(number of test cases)? Not extremely meaningful (since it depends on LOC), but at least it's easy to calculate.
Another one could be (number of test cases)/(number of methods).

I wrote a blog post about why High Test Coverage Ratio is a Good Thing Anyway.
I agree that: when a portion of code is executed by tests, it doesn’t mean that the validity of the results produced by this portion of code is verified by tests.
But still, if you are heavily using contracts to check states validity during tests execution, high test coverage will mean a lot of verification anyway.

The value in code coverage is it gives you some idea of what has been exercised by tests.
The phrase "code coverage" is often used to mean statement coverage, e.g., "how much of my code (in lines) has been executed", but in fact there are over a hundred varieties of "coverage". These other versions of coverage try to provide a more sophisticated view what it means to exercise code.
For example, condition coverage measures how many of the separate elements of conditional expressions have been exercised. This is different than statement coverage. MC/DC
"modified condition/decision coverage" determines whether the elements of all conditional expressions have all been demonstrated to control the outcome of the conditional, and is required by the FAA for aircraft software. Path coverage meaures how many of the possible execution paths through your code have been exercised. This is a better measure than statement coverage, in that paths essentially represent different cases in the code. Which of these measures is best to use depends on how concerned you are about the effectiveness of your tests.
Wikipedia discusses many variations of test coverage reasonably well.
http://en.wikipedia.org/wiki/Code_coverage

As a rule of thumb, defect injection rates proportionally trail code yield and they both typically follow a Rayleigh distribution curve.
At some point your defect detection rate will peak and then start to diminish.
This apex represents 40% of discovered defects.
Moving forward with simple regression analysis you can estimate how many defects remain in your product at any point following the peak.
This is one component of Lawrence Putnam's model.

This hasn't been mentioned, but the amount of change in a given file of code or method (by looking at version control history) is interesting particularly when you're building up a test suite for poorly tested code. Focus your testing on the parts of the code you change a lot. Leave the ones you don't for later.
Watch out for a reversal of cause and effect. You might avoid changing untested code and you might tend to change tested code more.

SQLite is an extremely well-tested library, and you can extract all kinds of metrics from it.
As of version 3.6.14 (all statistics in the report are against that release of SQLite), the SQLite library consists of approximately 63.2 KSLOC of C code. (KSLOC means thousands of "Source Lines Of Code" or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 715 times as much test code and test scripts - 45261.5 KSLOC.
In the end, what always strikes me as the most significant is none of those possible metrics seem to be as important as the simple statement, "it meets all the requirements." (So don't lose sight of that goal in the process of achieving it.)
If you want something to judge a team's progress, then you have to lay down individual requirements. This gives you something to point to and say "this one's done, this one isn't". It's not linear (solving each requirement will require varying work), and the only way you can linearize it is if the problem has already been solved elsewhere (and thus you can quantize work per requirement).

I like revenue, sales numbers, profit. They are pretty good metrics of a code base.

Probably not only measuring the code covered (touched) by the unit tests but how good the assertions are.
One metric easy to implement is to measure the size of the Assert.AreEqual
You can create your own Assert implementation calling Assert.AreEqual and measuring the size of the object passed as second parameter.

Related

What is the rationale behind the choice of specific kinds of coverage criterias when looking at an algorithm?

So, I am unsure of what are the pros and cons of du-path coverage (or for instance, any data flow coverage criteria) versus predicate criterias or branch/node criterias.
I can see that in certain cases there are clear advantages of one kind of coverage to the other. For instance, if a lot of my algorithm consists in something alike the next example
void m();
void n();
void method(boolean b) {
if (b) {
m();
} else {
n();
}
}
it is clear that using any kind of data flow coverage criteria will let a lot of logic untested (which is something we'd like to avoid). Using for the given case a predicate/clause criteria would be way better.
Now, what I'd like to know is for the general case, what are the things you look for in an algorithm when deciding which kind of coverage criteria you'll follow, between
Graph
Data Flow
Logic
Input
Syntax
kinds of coverage criteria (basically, the ones found in Introduction to Software Testing). That is, I am basically looking for general heuristics to follow, for the general case.
Thanks
Admittedly, I've not done much research in this area and am not necessarily following you completely (mostly due to my lack of experience in this subject). But hopefully this is not too far off base.
My understanding of Code Coverage has always been that you want to cover every possible execution path. Now I know some paths are much more important than others (for example: the "happy path" is much more important than some obscure path to set some property), but regardless of weather or not you "cover" every path, you at least want to be aware of their existence and make somewhat of a conscious choice on what and what not to cover (via unit tests or manually).
You can start by eyeballing methods and writing testcases, but this quickly becomes unreliable since you're guaranteed NOT to see every possibly execution path no matter the algorithm type. Even very small programs will produce bugs that were unaccounted for since the tester failed to think of "trying it" that way.
What you really need is a Code Coverage tool to tell you which execution paths were covered by your unit-tests and which were not. At that point, you can make logical decisions on weather or not to cover those missing cases (since not all cases may not be worth your time). Without such a tool, I'd think you'd spend countless man hours going line by line and keeping track of what testcases covered what... Even if the tool costs several hundred dollars (or even several thousand dollars), I think you'd quickly recoup those funds in time saved.
So my general heuristic for you is to use such a tool to track your testing coverage and make decisions to cover or not to cover based on its results.
Some code coverage tools (not comprehensive):
.NET - NCover
PHP - PHPCoverage
JAVA - EMMA
Ruby - CoverMe
JavaScript - JSCoverage

Embedded Software Defect Rate

What defect rate can I expect in a C++ codebase that is written for an embedded processor (DSP), given that there have been no unit tests, no code reviews, no static code analysis, and that compiling the project generates about 1500 warnings. Is 5 defects/100 lines of code a reasonable estimate?
Your question is "Is 5 defects/100 lines of code a reasonable estimate?" That question is extremely difficult to answer, and it's highly dependent on the codebase & code complexity.
You also mentioned in a comment "to show the management that there are probably lots of bugs in the codebase" -- that's great, kudos, right on.
In order to open management's figurative eyes, I'd suggest at least a 3-pronged approach:
take specific compiler warnings, and show how some of them can cause undefined / disastrous behavior. Not all warnings will be as weighty. For example, if you have someone using an uninitialized pointer, that's pure gold. If you have someone stuffing an unsigned 16-bit value into an unsigned 8-bit value, and it can be shown that the 16-bit value will always be <= 255, that one isn't gonna help make your case as strongly.
run a static analysis tool. PC-Lint (or Flexelint) is cheap & provides good "bang for the buck". It will almost certainly catch stuff the compiler won't, and it can also run across translation units (lint everything together, even with 2 or more passes) and find more subtle bugs. Again, use some of these as indications.
run a tool that will give other metrics on code complexity, another source of bugs. I'd recommend M Squared's Resource Standard Metrics (RSM) which will give you more information and metrics (including code complexity) than you could hope for. When you tell management that a complexity score over 50 is "basically untestable" and you have a score of 200 in one routine, that should open some eyes.
One other point: I require clean compiles in my groups, and clean Lint output too. Usually this can accomplished solely by writing good code, but occasionally the compiler / lint warnings need to be tweaked to quiet the tool for things that aren't problems (use judiciously).
But the important point I want to make is this: be very careful when going in & fixing compiler & lint warnings. It's an admirable goal, but you can also inadvertantly break working code, and/or uncover undefined behavior that accidentally worked in the "broken" code. Yes, this really does happen. So tread carefully.
Lastly, if you have a solid set of tests already in place, that will help you determine if you accidentally break something while refactoring.
Good luck!
Despite my scepticism of the validity of any estimate in this case, I have found some statistics that may be relevant.
In this article, the author cites figures from a "a large body of empirical studies", published in Software Assessments, Benchmarks, and Best Practices (Jones, 2000). At SIE CMM Level 1, which sounds like the level of this code, one can expect a defect rate of 0.75 per function point. I'll leave it to you to determine how function points and LOC may relate in your code - you'll probably need a metrics tool to perform that analysis.
Steve McConnell in Code Complete cites a study of 11 projects developed by the same team, 5 without code reviews, 6 with code reviews. The defect rate for the non-reviewed code was 4.5 per 100 LOC, and for the reviewed it was 0.82. So on that basis, your estimate seems fair in the absence of any other information. However I have to assume a level of professionalism amongst this team (just from the fact that they felt the need to perform the study), and that they would have at least attended to the warnings; your defect rate could be much higher.
The point about warnings is that some are benign, and some are errors (i.e. will result in undesired behaviour of the software), if you ignore them on the assumption that they are all benign, you will introduce errors. Moreover some will become errors under maintenance when other conditions change, but if you have already chosen to accept a warning, you have no defence against introduction of such errors.
Take a look at the code quality. It would quickly give you a indication of the amount of problems hiding in the source. If the source is ugly and take a long time to understand there will be a lot of bugs in the code.
Well structured code with consistent style and that is easy to understand are going to contain less problems. Code shows how much effort and thought went into it.
My guess is if the source contains that many warnings there is going to be a lot of bugs hiding out in the code.
That also depends on who wrote the code (level of experience), and how big the code base is.
I would treat all warnings as errors.
How many errors do you get when you run a static analysis tool on the code?
EDIT
Run cccc, and check the mccabe's cyclic complexity. It should tell how complex the code it.
http://sourceforge.net/projects/cccc/
Run other static analysis tools.
If you want to get an estimate of the number of defects, the usual way of statistical estimatation is to subsample the data. I would pick three medium-sized subroutines at random, and check them carefully for bugs (eliminate compiler warnings, run static analysis tool, etc). If you find three bugs in 100 total lines of code selected at random, it seems reasonable that a similar density of bugs are in the rest of the code.
The problem mentioned here of introducing new bugs is an important issue, but you don't need to check the modified code back into the production branch to run this test. I would suggest a thorough set of unit tests before modifying any subroutines, and cleaning up all the code followed by very thorough system testing before releasing new code to production.
If you want to demonstrate the benefits of unit tests, code reviews, static analysis tools, I suggest doing a pilot study.
Do some unit tests, code reviews, and run static analysis tools on a portion of the code. Show management how many bugs you find using those methods. Hopefully, the results speak for themselves.
The following article has some numbers based on real-life projects to which static analysis has been applied to: http://www.stsc.hill.af.mil/crosstalk/2003/11/0311German.html
Of course the criteria by which an anomaly is counted can affect the results dramatically, leading to the large variation in the figures shown in Table 1. In this table, the number of anomalies per thousand lines of code for C ranges from 500 (!) to about 10 (auto generated).

Unit Testing Machine Learning Code

I am writing a fairly complicated machine learning program for my thesis in computer vision. It's working fairly well, but I need to keep trying out new things out and adding new functionality. This is problematic because I sometimes introduce bugs when I am extending the code or trying to simplify an algorithm.
Clearly the correct thing to do is to add unit tests, but it is not clear how to do this. Many components of my program produce a somewhat subjective answer, and I cannot automate sanity checks.
For example, I had some code that approximated a curve with a lower-resolution curve, so that I could do computationally intensive work on the lower-resolution curve. I accidentally introduced a bug into this code, and only found it through a painstaking search when my the results of my entire program got slightly worse.
But, when I tried to write a unit-test for it, it was unclear what I should do. If I make a simple curve that has a clearly correct lower-resolution version, then I'm not really testing out everything that could go wrong. If I make a simple curve and then perturb the points slightly, my code starts producing different answers, even though this particular piece of code really seems to work fine now.
You may not appreciate the irony, but basically what you have there is legacy code: a chunk of software without any unit tests. Naturally you don't know where to begin. So you may find it helpful to read up on handling legacy code.
The definitive thought on this is Michael Feather's book, Working Effectively with Legacy Code. There used to be a helpful summary​ of that on the ObjectMentor site, but alas the website has gone the way of the company. However WELC has left a legacy in reviews and other articles. Check them out (or just buy the book), although the key lessons are the ones which S.Lott and tvanfosson cover in their replies.
2019 update: I have fixed the link to the WELC summary with a version from the Wayback Machine web archive (thanks #milia).
Also - and despite knowing that answers which comprise mainly links to other sites are low quality answers :) - here is a link to a new (2019 new) Google tutorial on Testing and Debugging ML code. I hope this will be of illumination to future Seekers who stumble across this answer.
"then I'm not really testing out everything that could go wrong."
Correct.
The job of unit tests is not to test everything that could go wrong.
The job of unit tests is to test that what you have does the right thing, given specific inputs and specific expected results. The important part here is the specific visible, external requirements are satisfied by specific test cases. Not that every possible thing that could go wrong is somehow prevented.
Nothing can test everything that could go wrong. You can write a proof, but you'll be hard-pressed to write tests for everything.
Choose your test cases wisely.
Further, the job of unit tests is to test that each small part of the overall application does the right thing -- in isolation.
Your "code that approximated a curve with a lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation. The integrated whole could also be tested to be sure that it works.
Your "computationally intensive work on the lower-resolution curve" for example, probably has several small parts that can be tested as separate units. In isolation.
That point of unit testing is to create small, correct units that are later assembled.
Without seeing your code, it's hard to tell, but I suspect that you are attempting to write tests at too high a level. You might want to think about breaking your methods down into smaller components that are deterministic and testing these. Then test the methods that use these methods by providing mock implementations that return predictable values from the underlying methods (which are probably located on a different object). Then you can write tests that cover the domain of the various methods, ensuring that you have coverage of the full range of possible outcomes. For the small methods you do so by providing values that represent the domain of inputs. For the methods that depend on these, by providing mock implementations that return the range of outcomes from the dependencies.
Your unit tests need to employ some kind of fuzz factor, either by accepting approximations, or using some kind of probabilistic checks.
For example, if you have some function that returns a floating point result, it is almost impossible to write a test that works correctly across all platforms. Your checks would need to perform the approximation.
TEST_ALMOST_EQ(result, 4.0);
Above TEST_ALMOST_EQ might verify that result is between 3.9 and 4.1 (for example).
Alternatively, if your machine learning algorithms are probabilistic, your tests will need to accommodate for it by taking the average of multiple runs and expecting it to be within some range.
x = 0;
for (100 times) {
x += result_probabilistic_test();
}
avg = x/100;
TEST_RANGE(avg, 10.0, 15.0);
Ofcourse, the tests are non-deterministic, so you will need to tune them such that you can get non-flaky tests with a high probability. (E.g., increase the number of trials, or increase the range of error).
You can also use mocks for this (e.g, a mock random number generator for your probabilistic algorithms), and they usually help for deterministically testing specific code paths, but they are a lot of effort to maintain. Ideally, you would use a combination of fuzzy testing and mocks.
HTH.
Generally, for statistical measures you would build in an epsilon for your answer. I.E. the mean square difference of your points would be < 0.01 or some such. Another option is to run several times and if it fails "too often" then you have an issue.
Get an appropriate test dataset (maybe a subset of what your using usually)
Calculate some metric on this dataset (e.g. the accuracy)
Note down the value obtained (cross-validated)
This should give an indication of what to set the threshold for
Of course if can be that when making changes to your code the performance on the dataset will increase a little, but if it ever decreases by large this would be an indication something is going wrong.

Testing with random inputs best practices

NOTE: I mention the next couple of paragraphs as background. If you just want a TL;DR, feel free to skip down to the numbered questions as they are only indirectly related to this info.
I'm currently writing a python script that does some stuff with POSIX dates (among other things). Unit testing these seems a little bit difficult though, since there's such a wide range of dates and times that can be encountered.
Of course, it's impractical for me to try to test every single date/time combination possible, so I think I'm going to try a unit test that randomizes the inputs and then reports what the inputs were if the test failed. Statisically speaking, I figure that I can achieve a bit more completeness of testing than I could if I tried to think of all potential problem areas (due to missing things) or testing all cases (due to sheer infeasability), assuming that I run it enough times.
So here are a few questions (mainly indirectly related to the above ):
What types of code are good candidates for randomized testing? What types of code aren't?
How do I go about determining the number of times to run the code with randomized inputs? I ask this because I want to have a large enough sample to determine any bugs, but don't want to wait a week to get my results.
Are these kinds of tests well suited for unit tests, or is there another kind of test that it works well with?
Are there any other best practices for doing this kind of thing?
Related topics:
Random data in unit tests?
I agree with Federico - randomised testing is counterproductive. If a test won't reliably pass or fail, it's very hard to fix it and know it's fixed. (This is also a problem when you introduce an unreliable dependency, of course.)
Instead, however, you might like to make sure you've got good data coverage in other ways. For instance:
Make sure you have tests for the start, middle and end of every month of every year between 1900 and 2100 (if those are suitable for your code, of course).
Use a variety of cultures, or "all of them" if that's known.
Try "day 0" and "one day after the end of each month" etc.
In short, still try a lot of values, but do so programmatically and repeatably. You don't need every value you try to be a literal in a test - it's fine to loop round all known values for one axis of your testing, etc.
You'll never get complete coverage, but it will at least be repeatable.
EDIT: I'm sure there are places where random tests are useful, although probably not for unit tests. However, in this case I'd like to suggest something: use one RNG to create a random but known seed, and then seed a new RNG with that value - and log it. That way if something interesting happens you will be able to reproduce it by starting an RNG with the logged seed.
With respect to the 3rd question, in my opinion random tests are not well suited for unit testing. If applied to the same piece of code, a unit test should succeed always, or fail always (i.e., wrong behavior due to bugs should be reproducible). You could however use random techniques to generate a large data set, then use that data set within your unit tests; there's nothing wrong with it.
Wow, great question! Some thoughts:
Random testing is always a good confidence building activity, though as you mentioned, it's best suited to certain types of code.
It's an excellent way to stress-test any code whose performance may be related to the number of times it's been executed, or to the sequence of inputs.
For fairly simple code, or code that expects a limited type of input, I'd prefer systematic test that explicitly cover all of the likely cases, samples of each unlikely or pathological case, and all the boundary conditions.
Q1) I found that distributed systems with lots of concurrency are good candidates for randomized testing. It is hard to create all possible scenarios for such applications, but random testing can expose problems that you never thought about.
Q2) I guess you could try to use statistics to build an confidence interval around having discovered all "bugs". But the practical answer is: run your randomized tests as many times as you can afford.
Q3) I have found that randomized testing is useful but after you have written the normal battery of unit, integration and regression tests. You should integrate your randomized tests as part of the normal test suite, though probably a small run. If nothing else, you avoid bit rot in the tests themselves, and get some modicum coverage as the team runs the tests with different random inputs.
Q4) When writing randomized tests, make sure you save the random seed with the results of the tests. There is nothing more frustrating than finding that your random tests caught a bug, and not being able to run the test again with the same input. Make sure your test can either be executed with the saved seed too.
A few things:
With random testing, you can't really tell how good a piece of code is, but you can tell how bad it is.
Random testing is better suited for things that have random inputs -- a prime example is anything that's exposed to users. So, for example, something that randomly clicks & types all over your app (or OS) is a good test of general robustness.
Similarly, developers count as users. So something that randomly assembles a GUI from your framework is another good candidate.
Again, you're not going to find all the bugs this way -- what you're looking for is "if I do a million whacky things, do ANY of them result in system corruption?" If not, you can feel some level of confidence that your app/OS/SDK/whatever might hold up to a few days' exposure to users.
...But, more importantly, if your random-beater-upper test app can crash your app/OS/SDK in about 5 minutes, that's about how long you'll have until the first fire-drill if you try to ship that sucker.
Also note: REPRODUCIBILITY IS IMPORTANT IN TESTING! Hence, have your test-tool log the random-seed that it used, and have a parameter to start with the same seed. In addition, have it either start from a known "base state" (i.e., reinstall everything from an image on a server & start there) or some recreatable base-state (i.e., reinstall from that image, then alter it according to some random-seed that the test tool takes as a parameter.)
Of course, the developers will appreciate if the tool has nice things like "save state every 20,000 events" and "stop right before event #" and "step forward 1/10/100 events." This will greatly aid them in reproducing the problem, finding and fixing it.
As someone else pointed out, servers are another thing exposed to users. Get yourself a list of 1,000,000 URLs (grep from server logs), then feed them to your random number generator.
And remember: "system went 24 hours of random pounding without errors" does not mean it's ready to ship, it just means it's stable enough to start some serious testing. Before it can do that, QA should feel free to say "look, your POS can't even last 24 hours under life-like random user simulation -- you fix that, I'm going to spend some time writing better tools."
Oh yeah, one last thing: in addition to the "pound it as fast & hard as you can" tests, have the ability to do "exactly what a real user [who was perhaps deranged, or a baby bounding the keyboard/mouse] would do." That is, if you're doing random user-events; do them at the speed that a very-fast typist or very-fast mouse-user could do (with occasional delays, to simulate a SLOW person), in addition to "as fast as my program can spit-out events." These are two **very different* types of tests, and will get very different reactions when bugs are found.
To make tests reproducible, simply use a fixed seed start value. That ensures the same data is used whenever the test runs. Tests will reliably pass or fail.
Good / bad candidates? Randomized tests are good at finding edge cases (exceptions). A problem is to define the correct result of a randomized input.
Determining the number of times to run the code: Simply try it out, if it takes too long reduce the iteration count. You may want to use a code coverage tool to find out what part of your application is actually tested.
Are these kinds of tests well suited for unit tests? Yes.
This might be slightly off-topic, but if you're using .net, there is Pex, which does something similar to randomized testing, but with more intuition by attempting to generate a "random" test case that exercises all of the paths through your code.
Here is my answer to a similar question: Is it a bad practice to randomly-generate test data?. Other answers may be useful as well.
Random testing is a bad practice a
long as you don't have a solution for
the oracle problem, i.e.,
determining which is the expected
outcome of your software given its
input.
If you solved the oracle problem, you
can get one step further than simple
random input generation. You can
choose input distributions such that
specific parts of your software get
exercised more than with simple
random.
You then switch from random testing to
statistical testing.
if (a > 0)
// Do Foo
else (if b < 0)
// Do Bar
else
// Do Foobar
If you select a and b randomly in
int range, you exercise Foo 50% of
the time, Bar 25% of the time and
Foobar 25% of the time. It is likely
that you will find more bugs in Foo
than in Bar or Foobar.
If you select a such that it is
negative 66.66% of the time, Bar and
Foobar get exercised more than with
your first distribution. Indeed the
three branches get exercised each
33.33% of the time.
Of course, if your observed outcome is
different than your expected outcome,
you have to log everything that can be
useful to reproduce the bug.
Random testing has the huge advantage that individual tests can be generated for extremely low cost. This is true even if you only have a partial oracle (for example, does the software crash?)
In a complex system, random testing will find bugs that are difficult to find by any other means. Think about what this means for security testing: even if you don't do random testing, the black hats will, and they will find bugs you missed.
A fascinating subfield of random testing is randomized differential testing, where two or more systems that are supposed to show the same behavior are stimulated with a common input. If their behavior differs, a bug (in one or both) has been found. This has been applied with great effect to testing of compilers, and invariably finds bugs in any compiler that has not been previously confronted with the technique. Even if you have only one compiler you can try it on different optimization settings to look for varying results, and of course crashes always mean bugs.

What is a reasonable code coverage % for unit tests (and why)? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
If you were to mandate a minimum percentage code-coverage for unit tests, perhaps even as a requirement for committing to a repository, what would it be?
Please explain how you arrived at your answer (since if all you did was pick a number, then I could have done that all by myself ;)
This prose by Alberto Savoia answers precisely that question (in a nicely entertaining manner at that!):
http://www.artima.com/forums/flat.jsp?forum=106&thread=204677
Testivus On Test Coverage
Early one morning, a programmer asked
the great master:
“I am ready to write some unit tests. What code coverage should I aim
for?”
The great master replied:
“Don’t worry about coverage, just write some good tests.”
The programmer smiled, bowed, and
left.
...
Later that day, a second programmer
asked the same question.
The great master pointed at a pot of
boiling water and said:
“How many grains of rice should I put in that pot?”
The programmer, looking puzzled,
replied:
“How can I possibly tell you? It depends on how many people you need to
feed, how hungry they are, what other
food you are serving, how much rice
you have available, and so on.”
“Exactly,” said the great master.
The second programmer smiled, bowed,
and left.
...
Toward the end of the day, a third
programmer came and asked the same
question about code coverage.
“Eighty percent and no less!” Replied the master in a stern voice,
pounding his fist on the table.
The third programmer smiled, bowed,
and left.
...
After this last reply, a young
apprentice approached the great
master:
“Great master, today I overheard you answer the same question about
code coverage with three different
answers. Why?”
The great master stood up from his
chair:
“Come get some fresh tea with me and let’s talk about it.”
After they filled their cups with
smoking hot green tea, the great
master began to answer:
“The first programmer is new and just getting started with testing.
Right now he has a lot of code and no
tests. He has a long way to go;
focusing on code coverage at this time
would be depressing and quite useless.
He’s better off just getting used to
writing and running some tests. He can
worry about coverage later.”
“The second programmer, on the other hand, is quite experience both
at programming and testing. When I
replied by asking her how many grains
of rice I should put in a pot, I
helped her realize that the amount of
testing necessary depends on a number
of factors, and she knows those
factors better than I do – it’s her
code after all. There is no single,
simple, answer, and she’s smart enough
to handle the truth and work with
that.”
“I see,” said the young apprentice,
“but if there is no single simple
answer, then why did you answer the
third programmer ‘Eighty percent and
no less’?”
The great master laughed so hard and
loud that his belly, evidence that he
drank more than just green tea,
flopped up and down.
“The third programmer wants only simple answers – even when there are
no simple answers … and then does not
follow them anyway.”
The young apprentice and the grizzled
great master finished drinking their
tea in contemplative silence.
Code Coverage is a misleading metric if 100% coverage is your goal (instead of 100% testing of all features).
You could get a 100% by hitting all the lines once. However you could still miss out testing a particular sequence (logical path) in which those lines are hit.
You could not get a 100% but still have tested all your 80%/freq used code-paths. Having tests that test every 'throw ExceptionTypeX' or similar defensive programming guard you've put in is a 'nice to have' not a 'must have'
So trust yourself or your developers to be thorough and cover every path through their code. Be pragmatic and don't chase the magical 100% coverage. If you TDD your code you should get a 90%+ coverage as a bonus. Use code-coverage to highlight chunks of code you have missed (shouldn't happen if you TDD though.. since you write code only to make a test pass. No code can exist without its partner test. )
Jon Limjap makes a good point - there is not a single number that is going to make sense as a standard for every project. There are projects that just don't need such a standard. Where the accepted answer falls short, in my opinion, is in describing how one might make that decision for a given project.
I will take a shot at doing so. I am not an expert in test engineering and would be happy to see a more informed answer.
When to set code coverage requirements
First, why would you want to impose such a standard in the first place? In general, when you want to introduce empirical confidence in your process. What do I mean by "empirical confidence"? Well, the real goal correctness. For most software, we can't possibly know this across all inputs, so we settle for saying that code is well-tested. This is more knowable, but is still a subjective standard: It will always be open to debate whether or not you have met it. Those debates are useful and should occur, but they also expose uncertainty.
Code coverage is an objective measurement: Once you see your coverage report, there is no ambiguity about whether standards have been met are useful. Does it prove correctness? Not at all, but it has a clear relationship to how well-tested the code is, which in turn is our best way to increase confidence in its correctness. Code coverage is a measurable approximation of immeasurable qualities we care about.
Some specific cases where having an empirical standard could add value:
To satisfy stakeholders. For many projects, there are various actors who have an interest in software quality who may not be involved in the day-to-day development of the software (managers, technical leads, etc.) Saying "we're going to write all the tests we really need" is not convincing: They either need to trust entirely, or verify with ongoing close oversight (assuming they even have the technical understanding to do so.) Providing measurable standards and explaining how they reasonably approximate actual goals is better.
To normalize team behavior. Stakeholders aside, if you are working on a team where multiple people are writing code and tests, there is room for ambiguity for what qualifies as "well-tested." Do all of your colleagues have the same idea of what level of testing is good enough? Probably not. How do you reconcile this? Find a metric you can all agree on and accept it as a reasonable approximation. This is especially (but not exclusively) useful in large teams, where leads may not have direct oversight over junior developers, for instance. Networks of trust matter as well, but without objective measurements, it is easy for group behavior to become inconsistent, even if everyone is acting in good faith.
To keep yourself honest. Even if you're the only developer and only stakeholder for your project, you might have certain qualities in mind for the software. Instead of making ongoing subjective assessments about how well-tested the software is (which takes work), you can use code coverage as a reasonable approximation, and let machines measure it for you.
Which metrics to use
Code coverage is not a single metric; there are several different ways of measuring coverage. Which one you might set a standard upon depends on what you're using that standard to satisfy.
I'll use two common metrics as examples of when you might use them to set standards:
Statement coverage: What percentage of statements have been executed during testing? Useful to get a sense of the physical coverage of your code: How much of the code that I have written have I actually tested?
This kind of coverage supports a weaker correctness argument, but is also easier to achieve. If you're just using code coverage to ensure that things get tested (and not as an indicator of test quality beyond that) then statement coverage is probably sufficient.
Branch coverage: When there is branching logic (e.g. an if), have both branches been evaluated? This gives a better sense of the logical coverage of your code: How many of the possible paths my code may take have I tested?
This kind of coverage is a much better indicator that a program has been tested across a comprehensive set of inputs. If you're using code coverage as your best empirical approximation for confidence in correctness, you should set standards based on branch coverage or similar.
There are many other metrics (line coverage is similar to statement coverage, but yields different numeric results for multi-line statements, for instance; conditional coverage and path coverage is similar to branch coverage, but reflect a more detailed view of the possible permutations of program execution you might encounter.)
What percentage to require
Finally, back to the original question: If you set code coverage standards, what should that number be?
Hopefully it's clear at this point that we're talking about an approximation to begin with, so any number we pick is going to be inherently approximate.
Some numbers that one might choose:
100%. You might choose this because you want to be sure everything is tested. This doesn't give you any insight into test quality, but does tell you that some test of some quality has touched every statement (or branch, etc.) Again, this comes back to degree of confidence: If your coverage is below 100%, you know some subset of your code is untested.
Some might argue that this is silly, and you should only test the parts of your code that are really important. I would argue that you should also only maintain the parts of your code that are really important. Code coverage can be improved by removing untested code, too.
99% (or 95%, other numbers in the high nineties.) Appropriate in cases where you want to convey a level of confidence similar to 100%, but leave yourself some margin to not worry about the occasional hard-to-test corner of code.
80%. I've seen this number in use a few times, and don't entirely know where it originates. I think it might be a weird misappropriation of the 80-20 rule; generally, the intent here is to show that most of your code is tested. (Yes, 51% would also be "most", but 80% is more reflective of what most people mean by most.) This is appropriate for middle-ground cases where "well-tested" is not a high priority (you don't want to waste effort on low-value tests), but is enough of a priority that you'd still like to have some standard in place.
I haven't seen numbers below 80% in practice, and have a hard time imagining a case where one would set them. The role of these standards is to increase confidence in correctness, and numbers below 80% aren't particularly confidence-inspiring. (Yes, this is subjective, but again, the idea is to make the subjective choice once when you set the standard, and then use an objective measurement going forward.)
Other notes
The above assumes that correctness is the goal. Code coverage is just information; it may be relevant to other goals. For instance, if you're concerned about maintainability, you probably care about loose coupling, which can be demonstrated by testability, which in turn can be measured (in certain fashions) by code coverage. So your code coverage standard provides an empirical basis for approximating the quality of "maintainability" as well.
Code coverage is great, but functionality coverage is even better. I don't believe in covering every single line I write. But I do believe in writing 100% test coverage of all the functionality I want to provide (even for the extra cool features I came with myself and which were not discussed during the meetings).
I don't care if I would have code which is not covered in tests, but I would care if I would refactor my code and end up having a different behaviour. Therefore, 100% functionality coverage is my only target.
My favorite code coverage is 100% with an asterisk. The asterisk comes because I prefer to use tools that allow me to mark certain lines as lines that "don't count". If I have covered 100% of the lines which "count", I am done.
The underlying process is:
I write my tests to exercise all the functionality and edge cases I can think of (usually working from the documentation).
I run the code coverage tools
I examine any lines or paths not covered and any that I consider not important or unreachable (due to defensive programming) I mark as not counting
I write new tests to cover the missing lines and improve the documentation if those edge cases are not mentioned.
This way if I and my collaborators add new code or change the tests in the future, there is a bright line to tell us if we missed something important - the coverage dropped below 100%. However, it also provides the flexibility to deal with different testing priorities.
I'd have another anectode on test coverage I'd like to share.
We have a huge project wherein, over twitter, I noted that, with 700 unit tests, we only have 20% code coverage.
Scott Hanselman replied with words of wisdom:
Is it the RIGHT 20%? Is it the 20%
that represents the code your users
hit the most? You might add 50 more
tests and only add 2%.
Again, it goes back to my Testivus on Code Coverage Answer. How much rice should you put in the pot? It depends.
Many shops don't value tests, so if you are above zero at least there is some appreciation of worth - so arguably non-zero isn't bad as many are still zero.
In the .Net world people often quote 80% as reasonble. But they say this at solution level. I prefer to measure at project level: 30% might be fine for UI project if you've got Selenium, etc or manual tests, 20% for the data layer project might be fine, but 95%+ might be quite achievable for the business rules layer, if not wholly necessary. So the overall coverage may be, say, 60%, but the critical business logic may be much higher.
I've also heard this: aspire to 100% and you'll hit 80%; but aspire to 80% and you'll hit 40%.
Bottom line: Apply the 80:20 rule, and let your app's bug count guide you.
For a well designed system, where unit tests have driven the development from the start i would say 85% is a quite low number. Small classes designed to be testable should not be hard to cover better than that.
It's easy to dismiss this question with something like:
Covered lines do not equal tested logic and one should not read too much into the percentage.
True, but there are some important points to be made about code coverage. In my experience this metric is actually quite useful, when used correctly. Having said that, I have not seen all systems and i'm sure there are tons of them where it's hard to see code coverage analysis adding any real value. Code can look so different and the scope of the available test framework can vary.
Also, my reasoning mainly concerns quite short test feedback loops. For the product that I'm developing the shortest feedback loop is quite flexible, covering everything from class tests to inter process signalling. Testing a deliverable sub-product typically takes 5 minutes and for such a short feedback loop it is indeed possible to use the test results (and specifically the code coverage metric that we are looking at here) to reject or accept commits in the repository.
When using the code coverage metric you should not just have a fixed (arbitrary) percentage which must be fulfilled. Doing this does not give you the real benefits of code coverage analysis in my opinion. Instead, define the following metrics:
Low Water Mark (LWM), the lowest number of uncovered lines ever seen in the system under test
High Water Mark (HWM), the highest code coverage percentage ever seen for the system under test
New code can only be added if we don't go above the LWM and we don't go below the HWM. In other words, code coverage is not allowed to decrease, and new code should be covered. Notice how i say should and not must (explained below).
But doesn't this mean that it will be impossible to clean away old well-tested rubbish that you have no use for anymore? Yes, and that's why you have to be pragmatic about these things. There are situations when the rules have to be broken, but for your typical day-to-day integration my experience it that these metrics are quite useful. They give the following two implications.
Testable code is promoted.
When adding new code you really have to make an effort to make the code testable, because you will have to try and cover all of it with your test cases. Testable code is usually a good thing.
Test coverage for legacy code is increasing over time.
When adding new code and not being able to cover it with a test case, one can try to cover some legacy code instead to get around the LWM rule. This sometimes necessary cheating at least gives the positive side effect that the coverage of legacy code will increase over time, making the seemingly strict enforcement of these rules quite pragmatic in practice.
And again, if the feedback loop is too long it might be completely unpractical to setup something like this in the integration process.
I would also like to mention two more general benefits of the code coverage metric.
Code coverage analysis is part of the dynamic code analysis (as opposed to the static one, i.e. Lint). Problems found during the dynamic code analysis (by tools such as the purify family, http://www-03.ibm.com/software/products/en/rational-purify-family) are things like uninitialized memory reads (UMR), memory leaks, etc. These problems can only be found if the code is covered by an executed test case. The code that is the hardest to cover in a test case is usually the abnormal cases in the system, but if you want the system to fail gracefully (i.e. error trace instead of crash) you might want to put some effort into covering the abnormal cases in the dynamic code analysis as well. With just a little bit of bad luck, a UMR can lead to a segfault or worse.
People take pride in keeping 100% for new code, and people discuss testing problems with a similar passion as other implementation problems. How can this function be written in a more testable manner? How would you go about trying to cover this abnormal case, etc.
And a negative, for completeness.
In a large project with many involved developers, everyone is not going to be a test-genius for sure. Some people tend to use the code coverage metric as proof that the code is tested and this is very far from the truth, as mentioned in many of the other answers to this question. It is ONE metric that can give you some nice benefits if used properly, but if it is misused it can in fact lead to bad testing. Aside from the very valuable side effects mentioned above a covered line only shows that the system under test can reach that line for some input data and that it can execute without hanging or crashing.
If this were a perfect world, 100% of code would be covered by unit tests. However, since this is NOT a perfect world, it's a matter of what you have time for. As a result, I recommend focusing less on a specific percentage, and focusing more on the critical areas. If your code is well-written (or at least a reasonable facsimile thereof) there should be several key points where APIs are exposed to other code.
Focus your testing efforts on these APIs. Make sure that the APIs are 1) well documented and 2) have test cases written that match the documentation. If the expected results don't match up with the docs, then you have a bug in either your code, documentation, or test cases. All of which are good to vet out.
Good luck!
Code coverage is just another metric. In and of itself, it can be very misleading (see www.thoughtworks.com/insights/blog/are-test-coverage-metrics-overrated). Your goal should therefore not be to achieve 100% code coverage but rather to ensure that you test all relevant scenarios of your application.
I prefer to do BDD, which uses a combination of automated acceptance tests, possibly other integration tests, and unit tests. The question for me is what the target coverage of the automated test suite as a whole should be.
That aside, the answer depends on your methodology, language and testing and coverage tools. When doing TDD in Ruby or Python it's not hard to maintain 100% coverage, and it's well worth doing so. It's much easier to manage 100% coverage than 90-something percent coverage. That is, it's much easier to fill coverage gaps as they appear (and when doing TDD well coverage gaps are rare and usually worth your time) than it is to manage a list of coverage gaps that you haven't gotten around to and miss coverage regressions due to your constant background of uncovered code.
The answer also depends on the history of your project. I've only found the above to be practical in projects managed that way from the start. I've greatly improved the coverage of large legacy projects, and it's been worth doing so, but I've never found it practical to go back and fill every coverage gap, because old untested code is not well understood enough to do so correctly and quickly.
85% would be a good starting place for checkin criteria.
I'd probably chose a variety of higher bars for shipping criteria - depending on the criticality of the subsystems/components being tested.
Code coverage is great but only as long as the benefits that you get from it outweigh the cost/effort of achieving it.
We have been working to a standard of 80% for some time, however we have just made the decison to abandon this and instead be more focused on our testing. Concentrating on the complex business logic etc,
This decision was taken due to the increasing amount of time we spent chasing code coverage and maintaining existing unit tests. We felt we had got to the point where the benefit we were getting from our code coverage was deemed to be less than the effort that we had to put in to achieve it.
I use cobertura, and whatever the percentage, I would recommend keeping the values in the cobertura-check task up-to-date. At the minimum, keep raising totallinerate and totalbranchrate to just below your current coverage, but never lower those values. Also tie in the Ant build failure property to this task. If the build fails because of lack of coverage, you know someone's added code but hasn't tested it. Example:
<cobertura-check linerate="0"
branchrate="0"
totallinerate="70"
totalbranchrate="90"
failureproperty="build.failed" />
When I think my code isn't unit tested enough, and I'm not sure what to test next, I use coverage to help me decide what to test next.
If I increase coverage in a unit test - I know this unit test worth something.
This goes for code that is not covered, 50% covered or 97% covered.
Short answer: 60-80%
Long answer:
I think it totally depends on the nature of your project. I typically start a project by unit testing every practical piece. By the first "release" of the project you should have a pretty good base percentage based on the type of programming you are doing. At that point you can start "enforcing" a minimum code coverage.
If you've been doing unit testing for a decent amount of time, I see no reason for it not to be approaching 95%+. However, at a minimum, I've always worked with 80%, even when new to testing.
This number should only include code written in the project (excludes frameworks, plugins, etc.) and maybe even exclude certain classes composed entirely of code written of calls to outside code. This sort of call should be mocked/stubbed.
Generally speaking, from the several engineering excellence best practices papers that I have read, 80% for new code in unit tests is the point that yields the best return. Going above that CC% yields a lower amount of defects for the amount of effort exerted. This is a best practice that is used by many major corporations.
Unfortunately, most of these results are internal to companies, so there are no public literatures that I can point you to.
My answer to this conundrum is to have 100% line coverage of the code you can test and 0% line coverage of the code you can't test.
My current practice in Python is to divide my .py modules into two folders: app1/ and app2/ and when running unit tests calculate the coverage of those two folders and visually check (I must automate this someday) that app1 has 100% coverage and app2 has 0% coverage.
When/if I find that these numbers differ from standard I investigage and alter the design of the code so that coverage conforms to the standard.
This does mean that I can recommend achieving 100% line coverage of library code.
I also occasionally review app2/ to see if I could possible test any code there, and If I can I move it into app1/
Now I'm not too worried about the aggregate coverage because that can vary wildly depending on the size of the project, but generally I've seen 70% to over 90%.
With python, I should be able to devise a smoke test which could automatically run my app while measuring coverage and hopefully gain an aggreagate of 100% when combining the smoke test with unittest figures.
Check out Crap4j. It's a slightly more sophisticated approach than straight code coverage. It combines code coverage measurements with complexity measurements, and then shows you what complex code isn't currently tested.
Viewing coverage from another perspective: Well-written code with a clear flow of control is the easiest to cover, the easiest to read, and usually the least buggy code. By writing code with clearness and coverability in mind, and by writing the unit tests in parallel with the code, you get the best results IMHO.
In my opinion, the answer is "It depends on how much time you have". I try to achieve 100% but I don't make a fuss if I don't get it with the time I have.
When I write unit tests, I wear a different hat compared to the hat I wear when developing production code. I think about what the tested code claims to do and what are the situations that can possible break it.
I usually follow the following criteria or rules:
That the Unit Test should be a form of documentation on what's the expected behavior of my codes, ie. the expected output given a certain input and the exceptions it may throw that clients may want to catch (What the users of my code should know?)
That the Unit Test should help me discover the what if conditions that I may not yet have thought of. (How to make my code stable and robust?)
If these two rules doesn't produce 100% coverage then so be it. But once, I have the time, I analyze the uncovered blocks and lines and determine if there are still test cases without unit tests or if the code needs to be refactored to eliminate the unecessary codes.
It depends greatly on your application. For example, some applications consist mostly of GUI code that cannot be unit tested.
I don't think there can be such a B/W rule.
Code should be reviewed, with particular attention to the critical details.
However, if it hasn't been tested, it has a bug!
Depending on the criticality of the code, anywhere from 75%-85% is a good rule of thumb.
Shipping code should definitely be tested more thoroughly than in house utilities, etc.
This has to be dependent on what phase of your application development lifecycle you are in.
If you've been at development for a while and have a lot of implemented code already and are just now realizing that you need to think about code coverage then you have to check your current coverage (if it exists) and then use that baseline to set milestones each sprint (or an average rise over a period of sprints), which means taking on code debt while continuing to deliver end user value (at least in my experience the end user doesn't care one bit if you've increased test coverage if they don't see new features).
Depending on your domain it's not unreasonable to shoot for 95%, but I'd have to say on average your going to be looking at an average case of 85% to 90%.
I think the best symptom of correct code coverage is that amount of concrete problems unit tests help to fix is reasonably corresponds to size of unit tests code you created.
I think that what may matter most is knowing what the coverage trend is over time and understanding the reasons for changes in the trend. Whether you view the changes in the trend as good or bad will depend upon your analysis of the reason.
We were targeting >80% till few days back, But after we used a lot of Generated code, We do not care for %age, but rather make reviewer take a call on the coverage required.
From the Testivus posting I think the answer context should be the second programmer.
Having said this from a practical point of view we need parameter / goals to strive for.
I consider that this can be "tested" in an Agile process by analyzing the code we have the architecture, functionality (user stories), and then come up with a number. Based on my experience in the Telecom area I would say that 60% is a good value to check.