Unit Testing Vocabulary: "coverage" - unit-testing

I'm preparing some educational/training material with respect to Unit Testing, and want to double check some vocabulary.
In an example I'm using the developer has tested a Facade for all possible inputs but hasn't tested the more granular units 'behind' it.
Would one still say that the tests have "full coverage" - given they cover the entire range of inputs? I feel like "full coverage" is generally used to denote coverage of code/units... but there would certainly be full something when testing all possible inputs.
What's the other word I'm looking for?

If all possible inputs don't give you 100% code coverage, you have 100% scenario coverage, but not complete code coverage.
On that note, if you have 100% scenario coverage without full code coverage, you have dead code and you should think real hard about why it exists.

If you decide to use 'full coverage' then you might have troubles because most literature that speaks about coverage (and indeed, the tools that measure coverage as well) talk about the lines of code that are executed in the code under test after all tests are run.
The test cases that you propose would be said to cover the domain of the function (and assuming a function that is at least 1-to-1, they will cover the range as well).

it is full code coverage of the involved classes, but clearly not of the full system source. Tools can give this at different level.
note that it doesn't guarantee the code is correct, as it might have missed an scenario that needs to be handled altogether (both in test and feature code). Additionally the tests could trigger all code paths and not be asserting correctly.

Related

Code Coverage: Best practice on excluding non-testable code

We are currently discussing how to define our goals on code coverage in a C# project (this question is not limited to C# though). On the way, we found that we should exclude some code from being counted towards the code coverage. The most obvious are the tests themselves, as they have 100% coverage and should not influence the average. But there are also classes that are wrappers for system calls we need to be able to create mocks. They are untestable, as we don't want to test system libraries. The code is calculated towards the coverage though and makes it hard to move across the 90% mark.
We do not want to lie at ourselves by excluding every piece of code that is untested, which makes it fairly easy to walk towards 100%.
Is there any reference, article or discussion on this topic with experience in this area? We would like to explore the different views on this topic that may help us finding and developing our definition for "testable code".
Code coverage isn't a terribly useful metric. It's helpful in giving management and onboarding developers a general idea of how much you value testing, but concerning yourself about it any more is pedantic and might actually serve to distract you from writing meaningful tests.
This is my opinion, but it is not an isolated one
Don't worry about coverage, define a policy in terms of what your goals are, be clear and concise, make sure everyone is on the same page about the importance of unit testing, and have a code review process in place.
You don't need a tool, just some human intelligence.
Code coverage is not a metric for how complete your test suite is, but (if all all) how incomplete it is. A high test coverage might be totally meaningless, depending on how your code coverage tool measures it.
To get a better picture which parts of your code need more tests, I would separate the code with the means of the programming language. Any sane code coverage tool should be able to show the result for these separately. Depending on how your application is structured, you can put the untestable code (i.e. the test suite itself, adapter code like "wrappers for system calls", controllers, UI elements, etc.) in separate namespaces or even separate assemblies.
The Java folks can replace "namespace" with "package" and "assembly" with "jar" in the paragraph above.

Code coverage metrics when using groovy AST transforms

We use several AST transforms in our groovy code, such as #ToString and #EqualsAndHashCode. We use these so we don't have to maintain and test them. The problem is that code coverage metrics (using jacoco right now but open to change if it will help) don't know these are autogenerated methods and they cause a lot of code to appear uncovered even though it's not actually code we're writing.
Is there a way to include these from coverage metrics in any tools?
I guess you could argue that since we're putting the annotations we should still be testing the code being generated since a unit test shouldn't care how these methods are created, but just that they work.
I had a similar issue with #Log and the conditionals that it inserts into the code. That gets reported (cobertura) as a lack of branch coverage.
But as you said: it just reports it correctly. The code is not covered.
If you don't need the code, you should not have generated it. If you need it and aim for full test coverage, you must test it or at least "exercise" it, i.e. somehow use it from your test cases even without asserts.
From a test methodology standpoint, not covering generated code is equally questionable as using exclusion patterns. From a pragmatic standpoint, you may just want to live with it.

Are there any tools which specifically measure useful code coverage?

It is possible, albeit counterproductive, to write a unit test which executes some code, and asserts truth. This is a deliberately extreme, and simplified example - I'm sure most people have come across tests which execute code without actually making use of it.
Are there any code coverage tools which assess whether code covered is actually used as part of the assertions of the test?
What you want in essence is to compute the intersection between the covered code, and a backward code slice (minus the unit test itself) on the assertion in the covered test.
If that intersection is empty, the assertion doesn't test any part of the application.
Most code coverage tools don't compute slices, so I'd guess the answer to your question is "no".
You would need some kind of dependency analysis, i.e. to find out which statements the assertion depends on. Then this would have to be cross-checked with the coverage. CodeSurfer is one commercial tool that does dependency analysis for C.
Once again, I'd just investigate why this is happening ? Tests added just for the sake of increasing coverage numbers often are a symptom of a more serious cause. Educating everyone instead of picking out specific offenders usually works out better.
For identifying mistakes that have already been made: You can use a static code analysis tool that checks that
each test has atleast ONE assert. Of course this can also be defeated by rogue asserts.
there are no unused return values.
I saw the first one in something called TestLint from RoyOsherove and the second one I can think of Parasoft or similar.
Still this isn't a foolproof method and will sink significant time poring through the SCA reported issues. But that's the best I can think of.

Pitfalls of code coverage [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for real world examples of some bad side effects of code coverage.
I noticed this happening at work recently because of a policy to achieve 100% code coverage. Code quality has been improving for sure but conversely the testers seem to be writing more lax test plans because 'well the code is fully unit tested'. Some logical bugs managed to slip through as a result. They were a REALLY BIG PAIN to debug because 'well the code is fully unit tested'.
I think that was partly because our tool did statement coverage only. Still, it could have been time better spent.
If anyone has other negative side effects of having a code coverage policy please share. I'd like to know what kind of other 'problems' are happening out there in the real-world.
Thanks in advance.
EDIT: Thanks for all the really good responses. There are a few which I would mark as the answer but I can only mark one unfortunately.
In a sentence: Code coverage tells you what you definitely haven't tested, not what you have.
Part of building a valuable unit test suite is finding the most important, high-risk code and asking hard questions of it. You want to make sure the tough stuff works as a priority. Coverage figures have no notion of the 'importance' of code, nor the quality of tests.
In my experience, many of the most important tests you will ever write are the tests that barely add any coverage at all (edge cases that add a few extra % here and there, but find loads of bugs).
The problem with setting hard and (potentially counter-productive) coverage targets is that developers may have to start bending over backwards to test their code. There's making code testable, and then there's just torture. If you hit 100% coverage with great tests then that's fantastic, but in most situations the extra effort is just not worth it.
Furthermore, people start obsessing/fiddling with numbers rather than focussing on the quality of the tests. I've seen badly written tests that have 90+% coverage, just as I've seen excellent tests that only have 60-70% coverage.
Again, I tend to look at coverage as an indicator of what definitely hasn't been tested.
Just because there's code coverage doesn't mean you're actually testing all paths through the function.
For example, this code has four paths:
if (A) { ... } else { ... }
if (B) { ... } else { ... }
However just two tests (e.g. one with A and B true, one with A and B false) would give "100% code coverage."
This is a problem because the tendency is to stop testing once you've achieved the magic 100% number.
In my experience, the biggest problem with code coverage tools is the risk that somebody will fall victim to the belief that "high code coverage" equals "good testing." Most coverage tools just offer statement coverage metrics, as opposed to condition, data path or decision coverage. That means that it's possible to get 100% coverage on a bit of code like this:
for (int i = 0; i < MAX_RETRIES; ++i) {
if (someFunction() == MAGIC_NUMBER) {
break;
}
}
... without ever testing the termination condition on the for loop.
Worse, it's possible to get very high "coverage" from a test that simply invokes your application, without bothering to validate the output, or validating it incorrectly.
Simply put, low code coverage levels is certainly an indication of insufficient testing, but high coverage levels are not an indication of sufficient or correct testing.
Sometimes corner cases are so rare they're not worth testing, yet a strict code-coverage rule requires you test it anyway.
For example, in Java the MD5 algorithm is built-in, but technically it's possible that an "unsupported algorithm" type exception is thrown. It's never thrown and your test would have to go through significant gyrations to test that path.
It would be a lot of work wasted.
In my opinion, the greatest danger a team runs from measuring code coverage is that it rewards large tests, and penalizes small ones. If you have the choice between writing a single test that covers a large portion of your application's functionality, and writing ten small tests which test a single method, only measuring code coverage implies that you should write the large test.
However, writing the set of 10 small tests will give you much less brittle tests, and will test your application much more thoroughly than the one large test will. Thus, by measuring code coverage, particularly in an organization with still evolving testing habits, you can often set up the wrong incentives.
I know this isn't a direct answer to your question, but...
Any testing, regardless of what type, is insufficient by itself. Unit testing/code coverage is for developers. QA still needs to test the system as a whole. Business users still need to test the system as a whole as well.
The converse, QA tests the code completely, so developers shouldn't test is equally as bad. Testing is complimentary and different tests provide different things. Each test type can miss things that another might find.
Just like the rest of development, don't take shortcuts with testing, it'll only let bugs through.
Writing too targeted test cases.
Insufficient input variability testing of the Code
Large number of artificial test cases executed.
Not concentrating on the important test failures due to noise.
Difficulty in assigning defects because many conditions from many components must interact for a line to execute.
The worst side effect of having a 100% coverage goal is to spend a lot of the testing development cycle (75%+) hiting corner cases. Another poor effect of such a policy is the concentration of hitting a particular line of code rather than addressing the range of inputs. I don't really care that the strcpy function ran at least once. I really care that it ran against a wide variety of input. Having a policy is good. But having any extremely draconian policy is bad. The 100% metric of code coverage is neither necessary nor sufficient for code to be considered solid.
One of the largest pitfalls of code coverage is that people just talk about code coverage without actually specifying what type of code coverage they are talking about. The characteristics of C0, C1, C2 and even higher levels of code coverage are very different, so just talking about "code coverage" doesn't even make sense.
For example, achieving 100% full path coverage is pretty much impossible. If your program has n decision points, you need 2n tests (and depending on the definition, every single bit in a value is a decision point, so to achieve 100% full path coverage for an extremely simple function that just adds two ints, you need 18446744073709551616 tests). If you only have one loop, you already need infinitely many tests.
OTOH, achieving 100% C0 coverage is trivial.
Another important thing to remember, is that code coverage does not tell you what code was tested. It only tells you what code was run! You can try it out yourself: take a codebase that has 100% code coverage. Remove all the assertions from the tests. Now the codebase still has 100% coverage, but does not test a single thing! So, code coverage does not tell you what's tested, only what's not tested.
There are tools out there, Jumble for one, that perform analysis through branch coverage, by mutating your code to see if your test fails for all different permutations.
Directly from their website:
Jumble is a class level mutation
testing tool that works in conjunction
with JUnit. The purpose of mutation
testing is to provide a measure of the
effectiveness of test cases. A single
mutation is performed on the code to
be tested, the corresponding test
cases are then executed. If the
modified code fails the tests, then
this increases confidence in the
tests. Conversely, if the modified
code passes the tests this indicates a
testing deficiency.
Nothing wrong with code coverage - what I see wrong is the 100% figure. At some point the law of diminished returns kicks in and it becomes more expensive to test the last 1% than the other 99%. Code coverage is a worthy goal but common sense goes a long way.
!00% code coverage means well tested code is a complete myth. As developers we know the hard/complex/delicate parts of a system, and I would much rather see those areas properly tested, and only get 50% coverage, rather than the meaningless figure that every line has been run at least once.
In terms of a real world example, the only team that I was on that had 100% coverage wrote some of the worst code I've ever seen. 100% coverage was used to replace code review - the result was predicatably awful, to the extent that most code was thrown away, even though it passed the tests.
We have good tools for measuring code-coverage from unit tests. So it's tempting to rely on code-coverage of 100% to represent that you're "done testing." This is not true.
As other folks have mentioned, 100% code coverage doesn't prove that you have tested adequately, nor does 50% code coverage necessarily mean that you haven't tested adequately.
Measuring lines of code executed by tests is just one metric. You also have to test for a reasonable variety of function inputs, and also how the function or class behaves depending on some other external state. For example, some code functions differently based on the data in a database or in a file.
I've also blogged about this recently: http://karwin.blogspot.com/2009/02/unit-test-coverage.html
100% code coverage doesn't mean you're done with usnit tests
function int divide(int a, int b) {
return a/b;
}
With just 1 unit test, I get 100% code coverage for this function:
return divide(4,2) == 2;
Now, nobody would argue that this unit code with 100% coverage indicates that he feature works just fine.
I think code coverage is a good element to know if you are missing any obvious code path, but I would use it carefully.

How do you measure the quality of your unit tests?

If you (or your organization) aspires to thoroughly unit test your code, how do you measure the success or quality of your efforts?
Do you use code coverage, what percentage do you aim for?
Do you find that philosophies like TDD have a better impact than metrics?
My tip is not a way to determine whether you have good unit tests per se, but it's a way to grow a good test suite over time.
Whenever you encounter a bug, either in your development or reported by someone else, fix it twice. You first create a unit test that reproduces the problem. When you've got a failing test, then you go and fix the problem.
If a problem was there in the first place it's a hint about a subtlety about the code or the domain. Adding a test for it lets you make sure it's never going to be reintroduced in the future.
Another interesting aspect about this approach is that it'll help you understand the problem from a higher level before you actually go and look at the intricacies of the code.
Also, +1 for the value and pitfalls of test coverage already mentioned by others.
Code coverage is a useful metric but should be used carefully. Some people take code coverage, specially the percentage covered, a bit too seriously and see it as THE metric for good unit testing.
My experience tells me that more important than trying to get 100% coverage, which is not that easy, people should focus on checking the critical sections are covered. But even then you may get false positives.
I am very much pro-TDD, but I don't place much importance in coverage stats. To me, the success and usefulness of unit tests is felt over a period of development time by the development team, as the tests (a) uncover bugs up front, (b) enable refactoring and change without regression, (c) help flesh out modular, decoupled design, (d) and whatnot.
Or, as Martin Fowler put it, the anecdotal evidence in support of unit tests and TDD is overwhelming, but you cannot measure productivity. Read more on his bliki here: http://www.martinfowler.com/bliki/CannotMeasureProductivity.html
To attain a full measure of confidence in your code you need different levels of testing: unit, integration and functional. I agree with the advice given above that states that testing should be automated (continuous integration) and that unit testing should cover all branches with a variety of edge case datasets. Code coverage tools (e.g. Cobertura, Clover, EMMA etc) can identify holes in your branches, but not in the quality of your test datasets. Static code analysis such as FindBugs, PMD, CPD can identify problem areas in your code before they become an issue and go a long way towards promoting better development practices.
Testing should attempt to replicate the overall environment that the application will be running in as much as possible. It should start from the simplest possible case (unit) to the most complex (functional). In the case of a web application, getting an automated process to run through all the use cases of your website with a variety of browsers is a must so something like SeleniumRC should be in your toolkit.
However, software exists to meet a business need so there is also testing against requirements. This tends to be more of a manual process based on functional (web) tests. Essentially, you'll need to build a traceability matrix against each requirement in the specification and the corresponding functional test. As functional tests are created they are matched up against one or more requirements (e.g. Login as Fred, update account details for password, logout again). This addresses the issue of whether or not the deliverable matches the needs of the business.
Overall, I would advocate a test driven development approach based on some flavour of automated unit testing (JUnit, nUnit etc). For integration testing I would recommend having a test database that is automatically populated at each build with a known dataset that illustrates common use cases but allows for other tests to build on. For functional testing you'll need some kind of user interface robot (SeleniumRC for web, Abbot for Swing etc). Metrics about each can easily be gathered during the build process and displayed on the CI server (eg Hudson) for all developers to see.
If it can break, it should be tested. If it can be tested, it should be automated.
If your primary way of measuring test quality is some automated metric, you've already failed.
Metrics can be misleading, and they can be gamed. And if the metric is the primary (or worse yet, only) means of judging quality they will be gamed (perhaps unintentionally).
Code coverage, for example, is deeply misleading because 100% code coverage is nowhere near complete test coverage. Also, a figure like "80% code coverage" is just as misleading without context. If that coverage is in the most complex bits of code and just misses the code which is so simple it's easy to verify by eye then that's significantly better than if that coverage is biased in the reverse way.
Also, it's important to distinguish between the test-domain of a test (it's feature-set, essentially) and its quality. Test quality is not determined by how much it tests just as code quality isn't determined by a laundry list of features. Test quality is determined by how well a test does its job in testing. That's actually very difficult to sum up in an automated metric.
The next time you go to write a unit test, try this experiment. See how many different ways you can write it such that it has the same code coverage and tests the same code. See whether its possible to write a very poor test that meets these criteria and a very good test as well. I think you may be surprised at the results.
Ultimately there's no substitute for experience and judgment. A human eye, hopefully several eyes, needs to look at the test code and decide if it's good or not.
Code coverage is to testing as testing is to programming. It can only tell you when there is a problem, it can't tell you when everything works. You should have 100% code coverage and beyond. Branches of code logic should be tested with several input values, fully exercising normal, edge, and corner cases.
I normally do TDD, so I write the tests first, which helps me see how I want to be able to use the objects.
Then, when I'm writing the classes, for the most part I can spot common pitfalls (i.e. assumptions that I'm making, e.g. a variable being of a particular type, or range of values) and when these come up I write a specific test for that specific case.
Aside from that, and getting as good as code coverage as possible (sometimes it's not possible to get 100%), you're more or less done. Then, if any bugs do come up in the future, you just make sure you write a test case for it that exposes it first, and will pass when fixed. Then fix as per normal.
Monitoring code coverage rates can be useful but instead of focusing on an arbitrary target rate (80 %, 90 %, 100 % ?) I have found it useful to aim for a positive trend over time.
I think some best practices for unit tests are:
They must be self-contained, i.e. not require too much configuration and external dependencies to run. Let tests build their own dependencies like files and Web sites required for the tests to run.
Use unit tests to reproduce bugs before fixing them. This helps prevent the bugs from surfacing again in the future.
Use a code coverage tool to spot critical code that is not exercised by any unit tests.
Integrate unit tests with nightly builds and release builds.
Publish test result reports and code coverage reports to a Web site where everyone in the team can browse them. The publishing should ideally be automated and integrated into the build system.
Do not expect to reach 100% code coverage unless you develop mission critical software. It can be very costly to reach this level and will for most projects not be worth the effort.
The concept of mutation testing seems promising as a way to measure (test?) the quality of your unit tests. Mutation testing basically means making small "mutations" to your production code and then seeing if any unit test fails. Small mutations typically means changing and to or or < to <=. If one ore more unit tests fail it means the "mutant" was caught. If the mutant survives your unit test suite it means you missed a test. When I apply mutation testing to code with 100% line and branch coverage it typically will find a few spots where I missed tests.
See https://en.wikipedia.org/wiki/Mutation_testing for a description of the concept and links to tools.
An additional technique I try to use is to partition your code into two parts. I've recently blogged about it here. The short description is to maintain your production code in two sets of libraries where one set (hopefully the larger set) has 100% line coverage (or better if you can measure it) and the other set (hopefully a tiny amount of code) has 0% coverage, yes zero percent coverage.
Your designs should allow this partitioning. This should make it easy to see the code that is not covered. Over time you may have ideas about how to move code from the smaller set to the larger set.
In addition to TDD, I found myself writing more sane tests since BDD (e.g. http://rspec.info/)
The everlasting discussion is always to mock or not to mock. Some mocks might become more complex than the code it is testing (which usually points to bad separation of concerns).
Therefore I like the idea of having a metric like: test complexity
per code complexity. or simplified: the number of test lines per lines of code.
Mutation Testing is the foolproof way, know more of it https://pitest.org/ and at https://github.com/vmzakharov/mutate-test-kata/blob/master/MutationTestKata.pdf