Effective testing of multiple students programming solutions

Effective testing of multiple students programming solutions - c++

As part of my lecture in C++ the students will have to solve assignments. Each solution shall implement the same functions with the same functionality and the same parameters (function name, return value, passing parameters). Only the code inside is different.
So i'm thinking about a way to test all solutions (around 30) in an effective way. Maybe the best way is to write a unit test as well as a shell script (or something similar) that compiles each solution once with the unit test and runs it.
But maybe there is a different and much better solution to this problem.

The reason why unit tests are one of the most efficient types of automated testing is because the return of investment is relatively small (compared to other types of testing), so it makes perfect sense to me to write a verification suite of tests.
You might even go so far as to give the students the test suite instead of a specification written in prose. This could introduce them to the concept of Test-Driven Development (although we normally tend to write the tests iteratively, and not in batches).

yes, unit tests are the obvious solution for most cases.
compiler warnings and static analysis is also useful.
timing the program's execution given a set of parameters is another fairly automated option -- depends on what you are interested in evaluating.
creating base classes with good diagnostics (which you can swap out the implementation for your evaluation if you prefer) is another option. you can also provide interfaces they must use, and hold two implementations. then exercise the programs as usual using the diagnostic implementation. it depends on what you are looking for.

Maybe I'm missing something obvious, but wouldn't it be sufficient just to run the code several times with boundary-testing parameter values?

Related

How to write good Unit Tests in Functional Programming

I'm using functions instead of classes, and I find that I can't tell when another function that it relies on is a dependency that should be individually unit-tested or an internal implementation detail that should not. How can you tell which one it is?
A little context: I'm writing a very simple Lisp interpreter which has an eval() function. It's going to have a lot of responsibilities, too many actually, such as evaluating symbols differently than lists (everything else evaluates to itself). When evaluating symbols, it has its own complex workflow (environment-lookup), and when evaluating lists, it's even more complicated, since the list can be a macro, function, or special-form, each of which have their own complex workflow and set of responsibilities.
I can't tell if my eval_symbol() and eval_list() functions should be considered internal implementation details of eval() which should be tested through eval()'s own unit tests, or genuine dependencies in their own right which should be unit-tested independently of eval()'s unit tests.

A significant motivation for the "unit test" concept is to control the combinatorial explosion of required test cases. Let's look at the examples of eval, eval_symbol and eval_list.
In the case of eval_symbol, we will want to test contingencies where the symbol's binding is:
missing (i.e. the symbol is unbound)
in the global environment
is directly within the current environment
inherited from a containing environment
shadowing another binding
... and so on
In the case of eval_list, we will want to test (among other things) what happens when the list's function position contains a symbol with:
no function or macro binding
a function binding
a macro binding
eval_list will invoke eval_symbol whenever it needs a symbol's binding (assuming a LISP-1, that is). Let's say that there are S test cases for eval_symbol and L symbol-related test cases for eval_list. If we test each of these functions separately, we could get away with roughly S + L symbol-related test cases. However, if we wish to treat eval_list as a black box and to test it exhaustively without any knowledge that it uses eval_symbol internally, then we are faced with S x L symbol-related test cases (e.g. global function binding, global macro binding, local function binding, local macro binding, inherited function binding, inherited macro binding, and so on). That's a lot more cases. eval is even worse: as a black box the number of combinations can become incredibly large -- hence the term combinatorial explosion.
So, we are faced with a choice of theoretical purity versus actual practicality. There is no doubt that a comprehensive set of test cases that exercises only the "public API" (in this case, eval) gives the greatest confidence that there are no bugs. After all, by exercising every possible combination we may turn up subtle integration bugs. However, the number of such combinations may be so prohibitively large as to preclude such testing. Not to mention that the programmer will probably make mistakes (or go insane) reviewing vast numbers of test cases that only differ in subtle ways. By unit-testing the smaller internal components, one can vastly reduce the number of required test cases while still retaining a high level of confidence in the results -- a practical solution.
So, I think the guideline for identifying the granularity of unit testing is this: if the number of test cases is uncomfortably large, start looking for smaller units to test.
In the case at hand, I would absolutely advocate testing eval, eval-list and eval-symbol as separate units precisely because of the combinatorial explosion. When writing the tests for eval-list, you can rely upon eval-symbol being rock solid and confine your attention to the functionality that eval-list adds in its own right. There are likely other testable units within eval-list as well, such as eval-function, eval-macro, eval-lambda, eval-arglist and so on.

My advice is quite simple: "Start somewhere!"
If you see a name of some def (or deffun) that looks like it might be fragile, well, you probably want to test it, don't you?
If you're having some trouble trying to figure out how your client code can interface with some other code unit, well, you probably want to write some tests somewhere that let you create examples of how to properly use that function.
If some function seems sensitive to data values, well, you might want to write some tests that not only verify it can handle any reasonable inputs properly, but also specifically exercise boundary conditions and odd or unusual data inputs.
Whatever seems bug-prone should have tests.
Whatever seems unclear should have tests.
Whatever seems complicated should have tests.
Whatever seems important should have tests.
Later, you can go about increasing your coverage to 100%. But you'll find that you will probably get 80% of your real results from the first 20% of your unit test coding (Inverted "Law of the Critical Few").
So, to review the main point of my humble approach, "Start somewhere!"
Regarding the last part of your question, I would recommend you think about any possible recursion or any additional possible reuse by "client" functions that you or subsequent developers might create in the future that would also call eval_symbol() or eval_list().
Regarding recursion, the functional programming style uses it a lot and it can be difficult to get right, especially for those of us who come from procedural or object-oriented programming, where recursion seems rarely encountered. The best way to get recursion right is to precisely target any recursive features with unit tests to make certain all possible recursive use cases are validated.
Regarding reuse, if your functions are likely to be invoked by anything other than a single use by your eval() function, they should probably be treated as genuine dependencies that deserve independent unit tests.
As a final hint, the term "unit" has a technical definition in the domain of unit testing as "the smallest piece of code software that can be tested in isolation.". That is a very old fundamental definition that may quickly clarify your situation for you.

This is somewhat orthogonal to the content of your question, but directly addresses the question posed in the title.
Idiomatic functional programming involves mostly side effect-free pieces of code, which makes unit testing easier in general. Defining a unit test typically involves asserting a logical property about the function under test, rather than building large amounts of fragile scaffolding just to establish a suitable test environment.
As an example, let's say we're testing extendEnv and lookupEnv functions as part of an interpreter. A good unit test for these functions would check that if we extend an environment twice with the same variable bound to different values, only the most recent value is returned by lookupEnv.
In Haskell, a test for this property might look like:
test =
let env = extendEnv "x" 5 (extendEnv "x" 6 emptyEnv)
in lookupEnv env "x" == Just 5
This test gives us some assurance, and doesn't require any setup or teardown other than creating the env value that we're interested in testing. However, the values under test are very specific. This only tests one particular environment, so a subtle bug could easily slip by. We'd rather make a more general statement: for all variables x and values v and w, an environment env extended twice with x bound to v after x is bound to w, lookupEnv env x == Just w.
In general, we need a formal proof (perhaps mechanized with a proof assistant like Coq, Agda, or Isabelle) in order to show that a property like this holds. However, we can get much closer than specifying test values by using QuickCheck, a library available for most functional languages that generates large amounts of arbitrary test input for properties we define as boolean functions:
prop_test x v w env' =
let env = extendEnv x v (extendEnv x w env')
in lookupEnv env x == Just w
At the prompt, we can have QuickCheck generate arbitrary inputs to this function, and see whether it remains true for all of them:
*Main> quickCheck prop_test
+++ OK, passed 100 tests.
*Main> quickCheckWith (stdArgs { maxSuccess = 1000 }) prop_test
+++ OK, passed 1000 tests.
QuickCheck uses some very nice (and extensible) magic to produce these arbitrary values, but it's functional programming that makes having those values useful. By making side effects the exception (sorry) rather than the rule, unit testing becomes less of a task of manually specifying test cases, and more a matter of asserting generalized properties about the behavior of your functions.
This process will surprise you frequently. Reasoning at this level gives your mind extra chances to notice flaws in your design, making it more likely that you'll catch errors before you even run your code.

I'm not really aware of any particular rule of thumb for this. But it seems like you should be asking yourself two questions:
Can you define the purpose of eval_symbol and eval_list without needing to say "part of the implementation of eval?
If you see a test fail for eval, would it be useful to to see whether any tests for eval_symbol and eval_list also fail?
If the answer to either of those is yes, I would test them separately.

Few months ago I wrote a simple "almost Lisp" interpreter in Python for an assignment. I designed it using Interpreter design pattern, unit tested the evaluation code. Then I added the printing and parsing code and transformed the test fixtures from abstract syntax representation (objects) to concrete syntax strings. Part of the assignment was to program simple recursive list processing functions, so I added them as functional tests.
To answer your question in general, the rules are pretty same like for OO. You should have all your public functions covered. In OO public methods are part of a class or an interface, in functional programming you most often have visibility control based around modules (similar to interfaces). Ideally, you would have full coverage for all functions, but if this isn't possible, consider TDD approach - start by writing tests for what you know you need and implement them. Auxilliary functions will be result of refactoring and as you wrote tests for everything important before, if tests work after refactoring, you are done and can write another test (iterate).
Good luck!

How to determine if an existing class can be unit-tested?

Recently, i took ownership of some c++ code. I am going to maintain this code, and add new features later on.
I know many people say that it is usually not worth adding unit-tests to existing code, but i would still like to add some tests which will at least partially cover the code. In particular, i would like to add tests which reproduce bugs which i fixed.
Some of the classes are constructed with some pretty complex state, which can make it more difficult to unit-test.
I am also willing to refactor the code to make it easier to test.
Is there any good article you recommend on guidelines which help to identify classes which are easier to unit-test? Do you have any advice of your own?

While Martin Fowler's book on refactoring is a treasure trove of information, why not take a look at "Working Effectively with Legacy Code."
Also, if you're going to be dealing with classes where there's a ton of global variables or huge amounts of state transitions I'd put in a lot of integration checks. Separate out as much of the code which interacts with the code you're refactoring to make sure that all expected inputs in the order they are recieved continue to produce the same outputs. This is critical as it's very easy to "fix" a subtle bug that might have been addressed somewhere else.
Take notes too. If you do find that there is a bug which another function/class expects and handles properly you'll want to change both at the same time. That's difficult unless you keep thorough records.

Presumably the code was written for a purpose, and a unit test will check if the purpose is met, i.e. the pre-conditions and post-conditions hold for the methods.
If the public class methods are such that you can externally check the state it can be unit tested easily enough (black-box test). If the class state is invisible or if you have to test tricky private methods, your test class may need to be a friend (white-box test).
A class that is hard to unit test will be one that
Has enormous dependencies, i.e. tightly coupled
Is intended to work in a high-volume or multi-threaded environment. There you would use a system test rather than a unit test and the actual output may not be totally determinate.

I written a fair number of blog posts about unit testing, non-trivial, C++ code: http://www.lenholgate.com/blog/2004/05/practical-testing.html
I've also written quite a lot about adding tests to existing code: http://www.lenholgate.com/blog/testing/

Almost everything can and should be unit tested. If not directly, then by using mock classes.
Since you decided to refactor your classes, try to use BDD or TDD approach.
To prevent breaking existing functionality, the only way is to have good integration tests, but usually it takes time to execute them all for a complex system.
Without more details on what you do, it is not that easy to give more implementation details. Some are :
use MVP or presenter first for developing gui
use design patterns where appropriate
use function and member pointers, or observer design pattern to break dependencies

I think that if you're having to come up with some "measure" to test if a class is testable, you're already fscked. You should be able to tell just by looking at it: can you write an independent program that links to this class alone and makes sure it works?
If a class is too huge so that you can't be sure just by looking at it...chances are it probably isn't testable. People that don't know how to make small, distinct interfaces generally don't know how to adhere to any other principle either.
In the end though, the way to find out if a class is testable is to try to put it in a harness. If you end up having to pull in half your program to do it, try refactoring. If you find that you can't even perform the most basic refactor without having to rewrite the entire program, analyze the expense of doing so.

We at IPL published a paper It's testing Jim, but not as we know it which explores the practical problems of testing C++ and suggests some techniques to address them that may well be of use given your question. These techniques are also well supported in Cantata++ - our C/C++ unit and integration testing tool.

Does YAGNI also apply when writing tests?

When I write code I only write the functions I need as I need them.
Does this approach also apply to writing tests?
Should I write a test in advance for every use-case I can think of just to play it safe or should I only write tests for a use-case as I come upon it?

I think that when you write a method you should test both expected and potential error paths. This doesn't mean that you should expand your design to encompass every potential use -- leave that for when it's needed, but you should make sure that your tests have defined the expected behavior in the face of invalid parameters or other conditions.
YAGNI, as I understand it, means that you shouldn't develop features that are not yet needed. In that sense, you shouldn't write a test that drives you to develop code that's not needed. I suspect, though, that's not what you are asking about.
In this context I'd be more concerned with whether you should write tests that cover unexpected uses -- for example, errors due passing null or out of range parameters -- or repeating tests that only differ with respect to the data, not the functionality. In the former case, as I indicated above, I would say yes. Your tests will document the expected behavior of your method in the face of errors. This is important information to people who use your method.
In the latter case, I'm less able to give you a definitive answer. You certainly want your tests to remain DRY -- don't write a test that simply repeats another test even if it has different data. Alternatively, you may not discover potential design issues unless you exercise the edge cases of your data. A simple example is a method that computes a sum of two integers: what happens if you pass it maxint as both parameters? If you only have one test, then you may miss this behavior. Obviously, this is related to the previous point. Only you can be sure when a test is really needed or not.

Yes YAGNI absolutely applies to writing tests.
As an example, I, for one, do not write tests to check any Properties. I assume that properties work a certain way, and until I come to one that does something different from the norm, I won't have tests for them.
You should always consider the validity of writing any test. If there is no clear benefit to you in writing the test, then I would advise that you don't. However, this is clearly very subjective, since what you might think is not worth it someone else could think is very worth the effort.
Also, would I write tests to validate input? Absolutely. However, I would do it to a point. Say you have a function with 3 parameters that are ints and it returns a double. How many tests are you going to write around that function. I would use YAGNI here to determine which tests are going to get you a good ROI, and which are useless.

Write the test as you need it. Tests are code. Writing a bunch of (initially failing) tests up front breaks the red/fix/green cycle of TDD, and makes it harder to identify valid failures vs. unwritten code.

You should write the tests for the use cases you are going to implement during this phase of development.
This gives the following benefits:
Your tests help define the functionality of this phase.
You know when you've completed this phase because all of your tests pass.

You should write tests that cover all your code, ideally. Otherwise, the rest of your tests lose value, and you will in the end debug that piece of code repeatedly.
So, no. YAGNI does not include tests :)

There is of course no point in writing tests for use cases you're not sure will get implemented at all - that much should be obvious to anyone.
For use cases you know will get implemented, test cases are subject to diminishing returns, i.e. trying to cover each and every possible obscure corner case is not a useful goal when you can cover all important and critical paths with half the work - assuming, of course, that the cost of overlooking a rarely occurring error is endurable; I would certainly not settle for anything less than 100% code and branch coverage when writing avionics software.

You'll probably get some variance here, but generally, the goal of writing tests (to me) is to ensure that all your code is functioning as it should, without side effects, in a predictable fashion and without defects. In my mind, then, the approach you discuss of only writing tests for use cases as they are come upon does you no real good, and may in fact cause harm.
What if the particular use case for the unit under test that you ignore causes a serious defect in the final software? Has the time spent developing tests bought you anything in this scenario beyond a false sense of security?
(For the record, this is one of the issues I have with using code coverage to "measure" test quality -- it's a measurement that, if low, may give an indication that you're not testing enough, but if high, should not be used to assume that you are rock-solid. Get the common cases tested, the edge cases tested, then consider all the ifs, ands and buts of the unit and test them, too.)
Mild Update
I should note that I'm coming from possibly a different perspective than many here. I often find that I'm writing library-style code, that is, code which will be reused in multiple projects, for multiple different clients. As a result, it is generally impossible for me to say with any certainty that certain use cases simply won't happen. The best I can do is either document that they're not expected (and hence may require updating the tests afterward), or -- and this is my preference :) -- just writing the tests. I often find option #2 is for more livable on a day-to-day basis, simply because I have much more confidence when I'm reusing component X in new application Y. And confidence, in my mind, is what automated testing is all about.

You should certainly hold off writing test cases for functionality you're not going to implement yet. Tests should only be written for existing functionality or functionality you're about to put in.
However, use cases are not the same as functionality. You only need to test the valid use cases that you've identified, but there's going to be a lot of other things that might happen, and you want to make sure those inputs get a reasonable response (which could well be an error message).
Obviously, you aren't going to get all the possible use cases; if you could, there'd be no need to worry about computer security. You should get at least the more plausible ones, and as problems come up you should add them to the use cases to test.

I think the answer here is, as it is in so many places, it depends. If the contract that a function presents states that it does X, and I see that it's got associated unit tests, etc., I'm inclined to think it's a well-tested unit and use it as such, even if I don't use it that exact way elsewhere. If that particular usage pattern is untested, then I might get confusing or hard-to-trace errors. For this reason, I think a test should cover all (or most) of the defined, documented behavior of a unit.
If you choose to test more incrementally, I might add to the doc comments that the function is "only tested for [certain kinds of input], results for other inputs are undefined".

I frequently find myself writing tests, TDD, for cases that I don't expect the normal program flow to invoke. The "fake it 'til you make it" approach has me starting, generally, with a null input - just enough to have an idea in mind of what the function call should look like, what types its parameters will have and what type it will return. To be clear, I won't just send null to the function in my test; I'll initialize a typed variable to hold the null value; that way when Eclipse's Quick Fix creates the function for me, it already has the right type. But it's not uncommon that I won't expect the program normally to send a null to the function. So, arguably, I'm writing a test that I AGN. But if I start with values, sometimes it's too big a chunk. I'm both designing the API and pushing its real implementation from the beginning. So, by starting slow and faking it 'til I make it, sometimes I write tests for cases I don't expect to see in production code.

If you're working in a TDD or XP style, you won't be writing anything "in advance" as you say, you'll be working on a very precise bit of functionality at any given moment, so you'll be writing all the necessary tests in order make sure that bit of functionality works as you intend it to.
Test code is similar with "code" itself, you won't be writing code in advance for every use cases your app has, so why would you write test code in advance ?

Can automated unit testing replace static type checking?

I've started to look into the whole unit testing/test-driven development idea, and the more I think about it, the more it seems to fill a similar role to static type checking. Both techniques provide a compile-time, rapid-response check for certain kinds of errors in your program. However, correct me if I'm wrong, but it seems that a unit test suite with full coverage would test everything static type checking would test, and then some. Or phrased another way, static type checks only go part of the way to "prove" that your program is correct, whereas unit tests will let you "prove" as much as you want (to a certain extent).
So, is there any reason to use a language with static type checking if you're using unit testing as well? A somewhat similar question was asked here, but I'd like to get into more detail. What specific advantages, if any, does static type checking have over unit tests? A few issues like compiler optimizations and intellisense come to mind, but are there other solutions for those problems? Are there other advantages/disadvantages I haven't thought of?

There is one immutable fact about software quality.
If it can't compile, it can't ship
In this rule, statically typed languages will win over dynamically typed languages.
Ok, yes this rule is not immutable. Web Apps can ship without compiling (I've deployed many test web apps that didn't compile). But what is fundamentally true is
The sooner you catch an error, the cheaper it is to fix
A statically typed language will prevent real errors from happening at one of the earliest possible moments in the software development cycle. A dynamic language will not. Unit Testing, if you are thorough to a super human level can take the place of a statically typed language.
However why bother? There are a lot of incredibly smart people out there writing an entire error checking system for you in the form of a Compiler. If you're concerned about getting errors sooner use a statically typed language.
Please do not take this post as a bashing of dynamic languages. I use dynamic languages daily and love them. They are incredibly expressive and flexible and allow for incredibly fanscinating program.s However in the case of early error reporting they do lose to statically typed languages.

For any reasonably sized project, you just cannot account for all situations with unit tests only.
So my answer is "no", and even if you manage to account for all situations, you've thereby defeated the whole purpose of using a dynamic language in the first place.
If you want to program type-safe, better use a type-safe language.

I would think that automated unit testing will be important to dynamic typed languages, but that doesn't mean it would replace static type checking in the context that you apply. In fact, some of those who use dynamic typing might actually be using it because they do not want the hassles of constant type safety checks.
The advantages dynamically typed languages offer over static typed languages go far beyond testing, and type safety is merely one aspect. Programming styles and design differences over dynamic and static typed languages also vary greatly.
Besides, unit tests that are written too vigorously enforce type safety would mean that the software shouldn't be dynamically typed after all, or the design being applied should be written in a statically typed language, not a dynamic one.

Having 100% code coverage doesn't mean you have fully tested your application. Consider the following code:
if (qty > 3)
{
applyShippingDiscount();
}
else
{
chargeFullAmountForShipping();
}
I can get 100% code coverage if I pump in values of qty = 1 and qty = 4.
Now imagine my business condition was that "...for orders of 3 or more items I am to apply a discount to the shipping costs..". Then I would need to be writing tests that worked on the boundaries. So I would design tests where qty was 2,3 and 4. I still have 100% coverage but more importantly I found a bug in my logic.
And that is the problem that I have with focusing on code coverage alone. I think that at best you end up with a situation where the developer creates some initial tests based on the business rules. Then in order to drive up the coverage number they reference their code when design new test cases.

Manifest typing (which I suppose you mean) is a form of specification, unit testing is much weaker since it only provides examples. The important difference is that a specification declares what has to hold in any case, while a test only covers examples. You can never be sure that your tests cover all boundary conditions.
People also tend to forget the value of declared types as documentation. For example if a Java method returns a List<String>, then I instantly know what I get, no need to read documentation, test cases or even the method code itself. Similarly for parameters: if the type is declared then I know what the method expects.
The value of declaring the type of local variables is much lower since in well-written code the scope of the variable's existence should be small. You can still use static typing, though: instead of declaring the type you let the compiler infer it. Languages like Scala or even C# allow you to do just this.
Some styles of testing get closer to a specification, e.g. QuickCheck or it's Scala variant ScalaCheck generate tests based on specifications, trying to guess the important boundaries.

I would word it a different way--if you don't have a statically-typed language, you had better have very thorough unit tests if you plan on doing anything "real" with that code.
That said, static typing (or rather, explicit typing) has some significant benefits over unit tests that make me prefer it generally. It creates much more understandable APIs and allows for quick viewing of the "skeleton" of an application (i.e. the entry points to each module or section of code) in a way that is much more difficult with a dynamically-typed language.
To sum up: in my opinion, given solid, thorough unit tests, the choice between a dynamically-typed language and a statically-typed language is mostly one of taste. Some people prefer one; others prefer the other. Use the right tool for the job. But this doesn't mean they're identical--statically-typed languages will always have an edge in certain ways, and dynamically-typed languages will always have an edge in certain different ways. Unit tests go a long way towards minimizing the disadvantages of dynamically-typed languages, but they do not eliminate them completely.

No.
But that's not the most important question, the most important question is: does it matter that it can't?
Consider the purpose of static type checking: avoiding a class of code defects (bugs). However, this has to be weighed in the context of the larger domain of all code defects. What matters most is not a comparison along a narrow sliver but a comparison across the depth and breadth of code quality, ease of writing correct code, etc. If you can come up with a development style / process which enables your team to produce higher quality code more efficiently without static type checking, then it's worth it. This is true even in the case where you have holes in your testing that static type checking would catch.

I suppose it could if you are very thorough. But why bother? If the language is already checking to ensure static types are correct, there would be no point in testing them (since you get it for free).
Also, if you are using static typed languages with an IDE, the IDE can provide you with errors and warnings, even before compiling to test. I am not certain there are any automated unit testing applications that can do the same.

Given all the benefits of dynamic, late-binding languages, I suppose that's one of the values offered by Unit Tests. You'll still need to code carefully and intentionally, but that's the #1 requirement for any kind of coding IMHO. Being able to write clear and simple tests helps prove the clarity and simplicity of your design and your implementation. It also provides useful clues for those who see your code later. But I don't think I'd count on it to detect mismatched types. But in practice I don't find that type-checking really catches many real errors anyway. It's just not a type of error I find occurring in real code, if you have a clear and simple coding style in the first place.
For javascript, I would expect that jsLint will find almost all type-checking issues. primarily by suggesting alternate coding styles to decrease your exposure.

Type checking helps enforce contracts between components in a system. Unit testing (as the name implies) verifies the internal logic of components.
For a single unit of code, I think unit testing really can make static type checking unnecessary. But in a complex system, automated tests cannot verify all the multitude ways that different components of the system might interact. For this, the use of interfaces (which are, in a sense, a kind of "contract" between components) becomes a useful tool for reducing potential errors. And interfaces require compile-time type checking.
I really enjoy programming in dynamic languages, so I'm certainly not bashing dynamic typing. This is just a problem that recently occurred to me. Unfortunately I don't really have any experience in using a dynamic language for a large and complex system, so I'd be interested to hear from other people whether this problem is real, or merely theoretical.

Is duplicated code more tolerable in unit tests?

I ruined several unit tests some time ago when I went through and refactored them to make them more DRY--the intent of each test was no longer clear. It seems there is a trade-off between tests' readability and maintainability. If I leave duplicated code in unit tests, they're more readable, but then if I change the SUT, I'll have to track down and change each copy of the duplicated code.
Do you agree that this trade-off exists? If so, do you prefer your tests to be readable, or maintainable?

Readability is more important for tests. If a test fails, you want the problem to be obvious. The developer shouldn't have to wade through a lot of heavily factored test code to determine exactly what failed. You don't want your test code to become so complex that you need to write unit-test-tests.
However, eliminating duplication is usually a good thing, as long as it doesn't obscure anything, and eliminating the duplication in your tests may lead to a better API. Just make sure you don't go past the point of diminishing returns.

Duplicated code is a smell in unit test code just as much as in other code. If you have duplicated code in tests, it makes it harder to refactor the implementation code because you have a disproportionate number of tests to update. Tests should help you refactor with confidence, rather than be a large burden that impedes your work on the code being tested.
If the duplication is in fixture set up, consider making more use of the setUp method or providing more (or more flexible) Creation Methods.
If the duplication is in the code manipulating the SUT, then ask yourself why multiple so-called “unit” tests are exercising the exact same functionality.
If the duplication is in the assertions, then perhaps you need some Custom Assertions. For example, if multiple tests have a string of assertions like:
assertEqual('Joe', person.getFirstName())
assertEqual('Bloggs', person.getLastName())
assertEqual(23, person.getAge())
Then perhaps you need a single assertPersonEqual method, so that you can write assertPersonEqual(Person('Joe', 'Bloggs', 23), person). (Or perhaps you simply need to overload the equality operator on Person.)
As you mention, it is important for test code to be readable. In particular, it is important that the intent of a test is clear. I find that if many tests look mostly the same, (e.g. three-quarters of the lines the same or virtually the same) it is hard to spot and recognise the significant differences without carefully reading and comparing them. So I find that refactoring to remove duplication helps readability, because every line of every test method is directly relevant to the purpose of the test. That's much more helpful for the reader than a random combination of lines that are directly relevant, and lines that are just boilerplate.
That said, sometimes tests are exercising complex situations that are similiar but still significantly different, and it is hard to find a good way to reduce the duplication. Use common sense: if you feel the tests are readable and make their intent clear, and you're comfortable with perhaps needing to update more than a theoretically minimal number of tests when refactoring the code invoked by the tests, then accept the imperfection and move on to something more productive. You can always come back and refactor the tests later, when inspiration strikes!

Implementation code and tests are different animals and factoring rules apply differently to them.
Duplicated code or structure is always a smell in implementation code. When you start having boilerplate in implementation, you need to revise your abstractions.
On the other hand, testing code must maintain a level of duplication. Duplication in test code achieves two goals:
Keeping tests decoupled. Excessive test coupling can make it hard to change a single failing test that needs updating because the contract has changed.
Keeping the tests meaningful in isolation. When a single test is failing, it must be reasonably straightforward to find out exactly what it is testing.
I tend to ignore trivial duplication in test code as long as each test method stays shorter than about 20 lines. I like when the setup-run-verify rhythm is apparent in test methods.
When duplication creeps up in the "verify" part of tests, it is often beneficial to define custom assertion methods. Of course, those methods must still test a clearly identified relation that can be made apparent in the method name: assertPegFitsInHole -> good, assertPegIsGood -> bad.
When test methods grow long and repetitive I sometimes find it useful to define fill-in-the-blanks test templates that take a few parameters. Then the actual test methods are reduced to a call to the template method with the appropriate parameters.
As for a lot of things in programming and testing, there is no clear-cut answer. You need to develop a taste, and the best way to do so is to make mistakes.

You can reduce repetition using several different flavours of test utility methods.
I'm more tolerant of repetition in test code than in production code, but I have been frustrated by it sometimes. When you change a class's design and you have to go back and tweak 10 different test methods that all do the same setup steps, it's frustrating.

I agree. The trade off exists but is different in different places.
I'm more likely to refactor duplicated code for setting up state. But less likely to refactor the part of the test that actually exercises the code. That said, if exercising the code always takes several lines of code then I might think that is a smell and refactor the actual code under test. And that will improve readability and maintainability of both the code and the tests.

Jay Fields coined the phrase that "DSLs should be DAMP, not DRY", where DAMP means descriptive and meaningful phrases. I think the same applies to tests, too. Obviously, too much duplication is bad. But removing duplication at all costs is even worse. Tests should act as intent-revealing specifications. If, for example, you specify the same feature from several different angles, then a certain amount of duplication is to be expected.

"refactored them to make them more DRY--the intent of each test was no longer clear"
It sounds like you had trouble doing the refactoring. I'm just guessing, but if it wound up less clear, doesn't that mean you still have more work to do so that you have reasonably elegant tests which are perfectly clear?
That's why tests are a subclass of UnitTest -- so you can design good test suites that are correct, easy to validate and clear.
In the olden times we had testing tools that used different programming languages. It was hard (or impossible) to design pleasant, easy-to-work with tests.
You have the full power of -- whatever language you're using -- Python, Java, C# -- so use that language well. You can achieve good-looking test code that's clear and not too redundant. There's no trade-off.

I LOVE rspec because of this:
It has 2 things to help -
shared example groups for testing common behaviour.
you can define a set of tests, then 'include' that set in your real tests.
nested contexts.
you can essentially have a 'setup' and 'teardown' method for a specific subset of your tests, not just every one in the class.
The sooner that .NET/Java/other test frameworks adopt these methods, the better (or you could use IronRuby or JRuby to write your tests, which I personally think is the better option)

I feel that test code requires a similar level of engineering that would normally be applied to production code. There can certainly be arguments made in favor of readability and I would agree that's important.
In my experience, however, I find that well-factored tests are easier to read and understand. If there's 5 tests that each look the same except for one variable that's changed and the assertion at the end, it can be very difficult to find what that single differing item is. Similarly, if it is factored so that only the variable that's changing is visible and the assertion, then it's easy to figure out what the test is doing immediately.
Finding the right level of abstraction when testing can be difficult and I feel it is worth doing.

I don't think there is a relation between more duplicated and readable code. I think your test code should be as good as your other code. Non-repeating code is more readable then duplicated code when done well.

Ideally, unit tests shouldn't change much once they are written so I would lean towards readability.
Having unit tests be as discrete as possible also helps to keep the tests focused on the specific functionality that they are targeting.
With that said, I do tend to try and reuse certain pieces of code that I wind up using over and over, such as setup code that is exactly the same across a set of tests.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js