Strange Unit Test Run Time Differences - unit-testing

I have unit test methods calling exactly same thing :
void Test()
{
for (int i = 0; i < 100000; i++);
}
One of them is always run in different duration.
If I remove first one, TestMethod3 is always different:
If I add another test methods, TestMethod6 is always different:
There is always one method that is different from others.
What is the reason behind this strange difference?
I am currently studying on algorithms and trying to measure run times with test methods. This difference made me think whether test method run times are reliable.

That has something to do with the test runner in visual studio. The tests are usually run simultaneously but the ones you see with the greater time is usually the one that was started first. I've noticed that in visual studio for years now. If you were to run one of them on their own you will notice that its run time will be longer than if it was run as part of a run all.
I've always assumed that it had to do with the timer being started early while the tests were still loading.

You can't test performance in a simple unit test. Part of the reason is that there are many different implementations and configurations of testing frameworks, with different impacts on the performance of a test.
The most notable is whether tests run in parallel, multi-threaded, or consecutively. Obviously the first option completely invalidates any benchmarking. The second option, though, still doesn't guarantee a valid benchmarking.
This is because of other factors which are independent of the actual unit testing framework: These include
Initial delays due to class loading and memory allocation
Just-in-time compilation of your byte code into machine code. This is difficult to control and can happen seemingly unpredictably.
Branch prediction, which may greatly influence your runtime behaviour, depending on the nature of the processed data and control flow
Garbage collection
Doing even remotely valid benchmarks in Java is an art form in itself. In order to get close, you should at least ensure that
you are running your code of interest in a single thread, with no other active threads
don't use garbage collection (i.e. make sure that there is enough memory for the test to perform without GC, and setting the GC options of your JVM appropriately)
have a warmup phase where you run your code in a sufficient number of iterations before starting to benchmark it.
This IBM article on 'Robust Java Benchmarking' is helpful as an introduction of the pitfalls of Java benchmarking.

Related

Are there tools/ methods to objectively measure performance?

I'm writing a high performance application (a raytracer) in C++ using Visual Studio, and I just spent two days trying to root out a performance drop I witnessed after refactoring the code. The reason it took so long was because the performance drop was smaller than the normal variation in execution time I witnessed from run to run.
Not sure if this is normal, but sometimes the program may run at around 33fps pretty consistently, then if you close and rerun, it may run at 37fps. This means that in order to test any new change, I had to manually run and rerun until I witnessed peak performance (And this could require up to like 10 runs). Simply running it for some large number of frames, and measuring the time doesn't fix this variability. For example, if the program runs for 40 seconds on average, it will nevertheless vary by over 1-2 seconds, which makes this test nearly useless for detecting the 1 millisecond per frame performance loss I was dealing with.
Visual Studio's profiling tools also didn't help find this small of an issue, because they also were subject to variation, and in any case, its not necessarily going to tell me the exact offending line, so I have to test solutions, and the profiler is not very effective at confirming a proposed solution's efficacy.
I realize this all may sound like premature optimization, but I don't think it is because I'm optimizing only after finishing complete features; I'm just trying to monitor changes in performance regularly so that issues like the above don't slip in and just get added to the apparent cost of the new feature.
Anyways, my question is simply whether there's a way to objectively determine the "real" speed of an application, discounting the effect of variation. Or, failing that, how do developers deal with such issues? I doubt that my current process is the ideal one.
There are lots of profilers for both c++ and openGL. For those who just need the links, here are they.
OpenGL debugger-profiler
C++ profilers but I recommend Google orbit because it has dark theme.
My eyes stopped at
Objectively measure performance
As you mentioned the speed varies from run to run because it's too complex system. It helps if the scope is small and it only tests some key algorithms. It worth to automatize and collect some reference data. As every scientist say one test is not a test, you should rely on regular tests with controlled environments.
And here comes some tricks that can be used to measure performance.
In the comments others said, an average based on several runs may help you. It softens the noise from the outside.
Process priority or processor affinity could help you control the environment. By giving low priority to other processes your program gains more resource.
Measuring the whole execution of a test and compare it against processor time. As several processes runs at the same time, processor time may differs from execution time.
Update your reference values if you do a software update. Perhaps one update comes with performance boost while other with security patch.
Give a performance range for your program instead of one specific number. Perhaps the temperature messed up your measurement and the clock speed was decreased.
If a test runs too fast to measure, execute the most critical part several times in a test case. Too fast depend on how accurate you can measure. On ms basis it's really hard to decide if a test executed in 2 ms instead of 1 ms is a failure or not. However, if executed 1000 times - 1033 ms compared to 1000 ms gives you better insight.
Only test what is the critical section. Set up the environment and start the stopwatch when everything is ready. The system startup could be another test.

Automated performance tests

At our company we have unit tests.
We are thinking of writing some automated performance tests that will also be part of the test suite, so that both developers and the automated build will run them. The tests will do something and then fail if it took more than some pre-estimated time.
The problem is, different computers have different CPU speeds, and also processes running in the background can slow down execution. So how should we go about these tests?
One strategy is to design your performance metrics for the best machine that code will run on; as long as it runs fast enough on worse machines, you're guaranteed to have better performance in production. Basically, include a fudge factor knowing that it will have to run on slower machines, presumably during testing/development.
Another strategy is to do some benchmarking during your test setup, and use that time amount as your "unit time" instead of using seconds. For example, calculating the 20th Fibonacci number using the dog-slow recursive algorithm, and then saying that all the tests have to run within 10 "20-fibs", so while the wall-clock time is going to be slower on slow machines, you have a machine-independant metric for how well it's running.
Processes running in the background is harder. Obviously you usually don't want other things interfering with your test, so one strategy is to try and eliminate that as much as possible - regular developers can probably kill some processes and run again if there's a failure, and your continuous integration box should be kept relatively clear.
If that doesn't work, or isn't good enough, you could try the opposite approach: run a bunch of CPU/IO intensive processes at the same time as your tests to mimic an overloaded system, and if the tests pass with that environment, the performance should be fine in a normal system
Depending on the limiting resource of your program (I/O, CPU, memory), you can get good results with measuring the used CPU time and comparing it to the system speed. For example, the performance tests for my current program obtain the spent CPU time with time and get the CPU speed from /proc/cpuinfo to measure the number of cycles spent for a computation.
This approach has two caveats: Firstly, it does not measure the achieved parallelity, and secondly, it does not measure external performance factors like I/O usage.
If the idea is to understand how code changes affect performance and ensure that the performance is greater than or equal to previous builds then you need to run the tests on a known hardware profile every time. The most accurate way to do this would be to set up a machine(s) that you use for your testing every single time the tests are executed. If many developers need to do this, sometimes simultaneously, perhaps creating a VM image that they could spin up and point to for the tests to execute on would be worthwhile.
You should not run these on the developers boxes themselves because as you mentioned all kinds of factors could affect the outcome of the tests on those boxes.
You should avoid trying to measure performance while under load/strain from outside of the system being tested, (low disk space, network bandwidth, memory, cpu, etc) unless those conditions are specifically set up as part of the test case. For instance, you can have 3 different test runs, one while the machine is under no load, another where you are under medium load (simulating other programs running in the background) and another under high load.
You can also run tests on various hardware profiles as part of your other stress/performance tests but you probably won't get much value out of running them against every build. Again, however, if you want you could do a few different test runs against different hardware profiles, this requires more setup though since you would need additional machines and/or VM images set up and the infrastructure to kick off the tests against these machines, gather the results and report on them.
+1 for Sam's response. I've done this a number of times in the past and it's critical to lock down your performance test environment and ensure you're minimizing any potential flux.
Running the tests on devs' systems may be a useful flag for individual devs, but having a central system to run the tests on is critical. One caveat about doing this in VMs: ensure you understand the load on the VM host system because load there can impact performance in the hosted VMs.
I've had the best, most consistent and useful results when I ran these sorts of suites during a nightly smoke check build.
It is also a question about tolerances (or acceptable capacity ranges) that will make your tests valid. Ideally, as has been stated, you need a predictable, stable and consistent set up for any useful comparison. That said if you understand the basic operational ranges of the SUT (CPU available, Mem Available etc.) then early developer testing can be done on a mix and match of systems and conditions that are within the known resource tolerances.

Unit testing concurrent software - what do you do?

As software gets more and more concurrent, how do you handle testing the core behaviour of the type with your unit tests (not the parallel behaviour, just the core behaviour)?
In the good old days, you had a type, you called it, and you checked either what it returned and/or what other things it called.
Nowadays, you call a method and the actual work gets scheduled to run on the next available thread; you don't know when it'll actually start and call the other things - and what's more, those other things could be concurrent too.
How do you deal with this? Do you abstract/inject the concurrent scheduler (e.g. abstract the Task Parallel Library and provide a fake/mock in the unit tests)?
What resources have you come across that helped you?
Edit
I've edited the question to emphasise testing the normal behaviour of the type (ignoring whatever parallel mechanism is used to take advantage of multi-core, e.g. the TPL)
Disclaimer: I work for Corensic, a small startup in Seattle. We've got a tool called Jinx that is designed to detect concurrency errors in your code. It's free for now while we're in Beta, so you might want to check it out. ( http://www.corensic.com/ )
In a nutshell, Jinx is a very thin hypervisor that, when activated, slips in between the processor and operating system. Jinx then intelligently takes slices of execution and runs simulations of various thread timings to look for bugs. When we find a particular thread timing that will cause a bug to happen, we make that timing "reality" on your machine (e.g., if you're using Visual Studio, the debugger will stop at that point). We then point out the area in your code where the bug was caused. There are no false positives with Jinx. When it detects a bug, it's definitely a bug.
Jinx works on Linux and Windows, and in both native and managed code. It is language and application platform agnostic and can work with all your existing tools.
If you check it out, please send us feedback on what works and doesn't work. We've been running Jinx on some big open source projects and already are seeing situations where Jinx can find bugs 50-100 times faster than simply stress testing code.
I recommend picking up a copy of Growing Object Oriented Software by Freeman and Pryce. The last couple of chapters are very enlightening and deal with this specific topic. It also introduces some terminology which helps in pinning down the notation for discussion.
To summarize ....
Their core idea is to split the functionality and concurrent/synchronization aspects.
First test-drive the functional part in a single synchronous thread like a normal object.
Once you have the functional part pinned down. You can move on to the concurrent aspect. To do that, you'd have to think and come up with "observable invariants w.r.t. concurrency" for your object, e.g. the count should be equal to the times the method is called. Once you have identified the invariants, you can write stress tests that run multiple threads et.all to try and break your invariants. The stress tests assert your invariants.
Finally as an added defence, run tools or static analysis to find bugs.
For passive objects, i.e. code that'd be called from clients on different threads: your test needs to mimic clients by starting its own threads. You would then need to choose between a notification-listening or sampling/polling approach to synchronize your tests with the SUT.
You could either block till you receive an expected notification
Poll certain observable side-effects with a reasonable timeout.
The field of Unit testing for race conditions and deadlocks is relativly new and lacks good tools.
I know of two such tools both in early alpha/beta stages:
Microsoft's Chess
Typemock Racer
ANother option is to try and write a "stress test" that would cause deadlocks/race condtions to surface, create multiople instances/threads and run them side by side. The downside of this approch is that if the test fail it would be very hard to reproduce it. I suggest using logs both in the test and production code so that you'll be able to understand what happened.
A technique I've found useful is to run tests within a tool that detects race conditions like Intel Parallel Inspector. The test runs much slower than normal, because dependencies on timing have to be checked, but a single run can find bugs that otherwise would require millions of repeated ordinary runs.
I've found this very useful when converting existing systems for fine-grained parallelism via multi-core.
Unit tests really should not test concurrency/asynchronous behaviour, you should use mocks there and verify that the mocks receive the expected input.
For integration tests I just explicitly call the background task, then check the expectations after that.
In Cucumber it looks like this:
When I press "Register"
And the email sending script is run
Then I should have an email
Given that your TPL will have its own separate unit test you don't need to verify that.
Given that I write two tests for each module:
1) A single threaded unit test that uses some environment variable or #define to turn of the TPL so that I can test my module for functional correctness.
2) A stress test that runs the module in its threaded deployable mode. This test attempts to find concurrency issues and should use lots of random data.
The second test often includes many modules and so is probably more of an integration/system test.

Testing approach for multi-threaded software

I have a piece of mature geospatial software that has recently had areas rewritten to take better advantage of the multiple processors available in modern PCs. Specifically, display, GUI, spatial searching, and main processing have all been hived off to seperate threads. The software has a pretty sizeable GUI automation suite for functional regression, and another smaller one for performance regression. While all automated tests are passing, I'm not convinced that they provide nearly enough coverage in terms of finding bugs relating race conditions, deadlocks, and other nasties associated with multi-threading. What techniques would you use to see if such bugs exist? What techniques would you advocate for rooting them out, assuming there are some in there to root out?
What I'm doing so far is running the GUI functional automation on the app running under a debugger, such that I can break out of deadlocks and catch crashes, and plan to make a bounds checker build and repeat the tests against that version. I've also carried out a static analysis of the source via PC-Lint with the hope of locating potential dead locks, but not had any worthwhile results.
The application is C++, MFC, mulitple document/view, with a number of threads per doc. The locking mechanism I'm using is based on an object that includes a pointer to a CMutex, which is locked in the ctor and freed in the dtor. I use local variables of this object to lock various bits of code as required, and my mutex has a time out that fires my a warning if the timeout is reached. I avoid locking where possible, using resource copies where possible instead.
What other tests would you carry out?
Edit: I have cross posted this question on a number of different testing and programming forums, as I'm keen to see how the different mind-sets and schools of thought would approach this issue. So apologies if you see it cross-posted elsewhere. I'll provide a summary links to responses after a week or so
Some suggestions:
Utilize the law of large numbers and perform the operation under test not only once, but many times.
Stress-test your code by exaggerating the scenarios. E.g. to test your mutex-holding class, use scenarios where the mutex-protected code:
is very short and fast (a single instruction)
is time-consuming (Sleep with a large value)
contains explicit context switches (Sleep (0))
Run your test on various different architectures. (Even if your software is Windows-only, test it on single- and multicore processors with and without hyperthreading, and a wide range of clock speeds)
Try to design your code such that most of it is not exposed to multithreading issues. E.g. instead of accessing shared data (which requires locking or very carefully designed lock-avoidance techniques), let your worker threads operate on copies of the data, and communicate with them using queues. Then you only have to test your queue class for thread-safety
Run your tests when the system is idle as well as when it is under load from other tasks (e.g. our build server frequently runs multiple builds in parallel. This alone revealed many multithreading bugs that happened when the system was under load.)
Avoid asserting on timeouts. If such an assert fails, you don't know whether the code is broken or whether the timeout was too short. Instead, use a very generous timeout (just to ensure that the test eventually fails). If you want to test that an operation doesn't take longer than a certain time, measure the duration, but don't use a timeout for this.
Whilst I agree with #rstevens answer in that there's currently no way to unit test threading issues with 100% certainty there are some things that I've found useful.
Firstly whatever tests you have make sure you run them on lots of different spec boxes. I have several build machines, all different, multi-core, single core, fast, slow, etc. The good thing about how diverse they are is that different ones will throw up different threading issues. I've regularly been surprised to add a new build machine to my farm and suddenly have a new threading bug exposed; and I'm talking about a new bug being exposed in code that has run 10000s of times on the other build machines and which shows up 1 in 10 on the new one...
Secondly most of the unit testing that you do on your code needn't involve threading at all. The threading is, generally, orthogonal. So step one is to tease the code apart so that you can test the actual code that does the work without worrying too much about the threaded nature. This usually means creating an interface that the threading code uses to drive the real code. You can then test the real code in isolation.
Thridly you can test where the threaded code interacts with the main body of code. This means writing a mock for the interface that you developed to separate the two blocks of code. By now the threading code is likely much simpler and you can then often place synchronisation objects in the mock that you've made so that you can control the code under test. So, you'd spin up your thread and wait for it to set an event by calling into your mock and then have it block on another event which your test code controls. The test code can then step the threaded code from one point in your interface to the next.
Finally (if you've decoupled things enough that you can do the earlier stuff then this is easy) you can then run larger pieces of the multi-threaded parts of the app under test and make sure you get the results that you expect; you can play with the priority of the threads and maybe even add a couple of test threads that simply eat CPU to stir things up a bit.
Now you run all of these tests many many times on different hardware...
I've also found that running the tests (or the app) under something like DevPartner BoundsChecker can help a lot as it messes with the thread scheduling such that it sometimes shakes out hard to find bugs. I also wrote a deadlock detection tool which checks for lock inversions during program execution but I only use that rarely.
You can see an example of how I test multi-threaded C++ code here: http://www.lenholgate.com/blog/2004/05/practical-testing.html
Not really an answer:
Testing multithreaded bugs is very difficult. Most bugs only show up if two (or more) threads go to specific places in code in a specific order.
If and when this condition is met may depend on the timing of the process running. This timing may change due to one of the following pre-conditions:
Type of processor
Processor speed
Number of processors/cores
Optimization level
Running inside or outside the debugger
Operating system
There are for sure more pre-conditions that I forgot.
Because MT-bugs so highly depend on the exact timing of the code running Heisenberg's "Uncertainty principle" comes in here: If you want to test for MT bugs you change the timing by your "measures" which may prevent the bug from occurring...
The timing thing is what makes MT bugs so highly non-deterministic.
In other words: You may have a software that runs for months and then crashes some day and after that may run for years. If you don't have some debug logs/core dumps etc. you may never know why it crashes.
So my conclusion is: There is no really good way to Unit-Test for thread-safety. You always have to keep your eyes open when programming.
To make this clear I will give you a (simplified) example from real life (I encountered this when changing my employer and looking on the existing code there):
Imagine you have a class. You want that class to automatically deleted if no-one uses it anymore. So you build a reference-counter into that class:
(I know it is a bad style to delete an instance of a class in one of it's methods. This is because of the simplification of the real code which uses a Ref class to handle counted references.)
class A {
private:
int refcount;
public:
A() : refcount(0) {
}
void Ref() {
refcount++;
}
void Release() {
refcount--;
if (refcount == 0) {
delete this;
}
}
};
This seams pretty simple and nothing to worry about. But this is not thread-safe!
It's because "refcount++" and "refcount--" are not atomic operations but both are three operations:
read refcount from memory to register
increment/decrement register
write refcount from register to memory
Each of those operations can be interrupted and another thread may, at the same time manipulate the same refcount. So if for example two threads want to incremenet refcount the following COULD happen:
Thread A: read refcount from memory to register (refcount: 8)
Thread A: increment register
CONTEXT CHANGE -
Thread B: read refcount from memory to register (refcount: 8)
Thread B: increment register
Thread B: write refcount from register to memory (refcount: 9)
CONTEXT CHANGE -
Thread A: write refcount from register to memory (refcount: 9)
So the result is: refcount = 9 but it should have been 10!
This can only be solved by using atomic operations (i.e. InterlockedIncrement() & InterlockedDecrement() on Windows).
This bug is simply untestable! The reason is that it is so highly unlikely that there are two threads at the same time trying to modify the refcount of the same instance and that there are context switches in between that code.
But it can happen! (The probability increases if you have a multi-processor or multi-core system because there is no context switch needed to make it happen).
It will happen in some days, weeks or months!
Looks like you are using Microsoft tools. There's a group at Microsoft Research that has been working on a tool specifically designed to shake out concurrency bugz. Check out CHESS. Other research projects, in their early stages, are Cuzz and Featherlite.
VS2010 includes a very good looking concurrency profiler, video is available here.
As Len Holgate mentions, I would suggest refactoring (if needed) and creating interfaces for the parts of the code where different threads interact with objects carrying a state. These parts of the code can then be tested separate from the code containing the actual functionality. To verify such a unit test, I would consider using a code coverage tool (I use gcov and lcov for this) to verify that everything in the thread safe interface is covered.
I think this is a pretty convenient way of verifying that new code is covered in the tests.
The next step is then to follow the advice of the other answers regarding how to run the tests.
Firstly, many thanks for the responses. For the responses posted across different forumes see;
http://www.sqaforums.com/showflat.php?Cat=0&Number=617621&an=0&page=0#Post617621
Testing approach for multi-threaded software
http://www.softwaretestingclub.com/forum/topics/testing-approach-for?xg_source=activity
and the following mailing list; software-testing#yahoogroups.com
The testing took significantly longer than expected, hence this late reply, leading me to the conclusion that adding multi-threading to existing apps is liable to be very expensive in terms of testing, even if the coding is quite straightforward. This could prove interesting for the SQA community, as there is increasingly more multi-threaded development going on out there.
As per Joe Strazzere's advice, I found the most effective way of hitting bugs was via automation with varied input. I ended up doing this on three PCs, which have ran a bank of tests repeatedly with varied input over about six weeks. Initially, I was seeing crashes one or two times per PC per day. As I tracked these down, it ended up with one or two per week between the three PCs, and we haven't had any further problems for the last two weeks. For the last two weeks we have also had a version with users beta testing, and are using the software in-house.
In addition to varying the input under automation, I also got good results from the following;
Adding a test option that allowed mutex time-outs to be read from a configuration file, which in turn could be controlled by my automation.
Extending mutex time-outs beyond the typical time expected to execute a section of thread code, and firing a debug exception on time-out.
Running the automation in conjunction with a debugger (VS2008) such that when a problem occurred there was a better chance of tracking it down.
Running without a debugger to ensure that the debugger was not hiding other timing related bugs.
Running the automation against normal release, debug, and fully optimised build. FWIW, the optimised build threw up errors not reproducible in the other builds.
The type of bugs uncovered tended to be serious in nature, e.g. dereferencing invalid pointers, and even under the debugger took quite a bit of tracking down. As has been discussed elsewhere, the SuspendThread and ResumeThread functions ended up being major culprits, and all use of these functions were replaced by mutexes. Similarly all critical sections were removed due to lack of time-outs. Closing documents and exiting the program were also a bug source, where in one instance a document was destroyed with a worker thread still active. To overcome this a single mutex was added per thread to control the life of the thread, and aquired by the document destructor to ensure the thread had terminated as expected.
Once again, many thanks for the all the detailed and varied responses. Next time I take on this type of activity, I'll be better prepared.

Unit test execution speed (how many tests per second?)

What kind of execution rate do you aim for with your unit tests (# test per second)? How long is too long for an individual unit test?
I'd be interested in knowing if people have any specific thresholds for determining whether their tests are too slow, or is it just when the friction of a long running test suite gets the better of you?
Finally, when you do decide the tests need to run faster, what techniques do you use to speed up your tests?
Note: integration tests are obviously a different matter again. We are strictly talking unit tests that need to be run as frequently as possible.
Response roundup: Thanks for the great responses so far. Most advice seems to be don't worry about the speed -- concentrate on quality and just selectively run them if they are too slow. Answers with specific numbers have included aiming for <10ms up to 0.5 and 1 second per test, or just keeping the entire suite of commonly run tests under 10 seconds.
Not sure whether it's right to mark one as an "accepted answer" when they're all helpful :)
All unit tests should run in under a second (that is all unit tests combined should run in 1 second). Now I'm sure this has practical limits, but I've had a project with a 1000 tests that run this fast on a laptop. You'll really want this speed so your developers don't dread refactoring some core part of the model (i.e., Lemme go get some coffee while I run these tests...10 minutes later he comes back).
This requirement also forces you to design your application correctly. It means that your domain model is pure and contains zero references to any type of persistance (File I/O, Database, etc). Unit tests are all about testing those business relatonships.
Now that doesn't mean you ignore testing your database or persistence. But these issues are now isolated behind repositories that can be separately tested with integration tests that is located in a separate project. You run your unit tests constantly when writing domain code and then run your integration tests once on check in.
The goal is 100s of tests per second. The way you get there is by following Michael Feather's rules of unit tests.
An important point that came up in a past CITCON discussion is that if your tests aren't this fast it is quite likely that you aren't getting the design benefits of unit testing.
If we're talking strictly unit tests, I'd aim more for completeness than speed. If the run time starts to cause friction, separate the test into different project/classes etc., and only run the tests related to what you're working on. Let the Integration server run all the tests on checkin.
I tend to focus more on readability of my tests than speed. However, I still try to make them reasonably fast. I think if they run on the order of milliseconds, you are fine. If they run a second or more per test... then you might be doing something that should be optimized.
Slow tests only become a problem as the system matures and causes the build to take hours, at which point you are more likely running into an issue of a lot of kind of slow tests rather than one or 2 tests that you can optimize easily... thus you should probably pay attention RIGHT AWAY if you see lots of tests running hundreds of milliseconds each (or worse, seconds each), rather than wait till it gets to the hundreds of tests taking that long point (at which point it is going to be really hard to solve the problem).
Even so, it will only reduce the time between when your automated build issues errors... which is ok if it is an hour later (or even a few hours later), I think. The problem is running them before you check in, but this can be avoided by selecting a small subset of tests to run that are related to what you are working on. Just make sure to fix the build if you check in code that breaks tests you didn't run!
We're currently at 270 tests in around 3.something seconds. There are probably around 8 tests that perform file IO.
These are run automatically upon a successful build of our libraries on every engineers machine. We have more extensive (and time consuming) smoke-testing that is done by the build machine every night, or can be started manually on an engineers machine.
As you can see we haven't yet reached the problem of tests being too time consuming. 10 seconds for me is the point where it starts to become intrusive, when we start to approach that it'll be something we'll take a look at. We'll likely move the lower level libraries, which are more robust since they change infrequently and have few dependencies, into the nightly builds, or a configuration where they're only executed by the build machine.
If you find it's taking more than a few seconds to run a hundred or so tests you may need to examine what you are classifying as a unit test and whether it would be better treated as a smoke test.
your mileage will obviously be highly variable depending on your area of development.
Data Point -- Python Regression Tests
Here are the numbers on my laptop for running "make test" for Python 2.5.2:
number of tests: 3851 (approx)
execution time: 9 min, 6 sec
execution rate: 7 tests / sec
One of the most important rules about unit tests is they should run fast.
How long is too long for an individual unit test?
Developers should be able to run the whole suite of unit tests in seconds, and definitely not in minutes and minutes. Developers should be able to quickly run them after changing the code in anyway. If it takes too long, they won't bother running them and you lose one of the main benefits of the tests.
What kind of execution rate do you aim for with your unit tests (# test per second)?
You should aim for each test to run in an order of milliseconds, anything over 1 second is probably testing too much.
We currently have about 800 tests that run in under 30 seconds, about 27 tests per second. This includes the time to launch the mobile emulator needed to run them. Most of them each take 0-5ms (if I remember correctly).
We have one or two that take about 3 seconds, which are probably candidates for checking, but the important thing is the whole test suite doesn't take so long that it puts off developers running it, and doesn't significantly slow down our continuous integration build.
We also have a configurable timeout limit set to 5 seconds -- anything taking longer will fail.
I judge my unit tests on a per test basis, not by by # of tests per second. The rate I aim for is 500ms or less. If it is above that, I will look into the test to find out why it is taking so long.
When I think a test is to slow, it usually means that it is doing too much. Therefore, just refactoring the test by splitting it up into more tests usually does the trick. The other times that I have noticed my tests running slow is when the test shows a bottleneck in my code, then a refactoring of the code is in order.
How long is too long for an individual
unit test?
I'd say it depends on the compile speed. One usually executes the tests at every compile. The objective of unit testing is not to slow down, but to bring a message "nothing broken, go on" (or "something broke, STOP").
I do not bother about test execution speed until this is something that starts to get annoying.
The danger is to stop running the tests because they're too slow.
Finally, when you do decide the tests
need to run faster, what techniques do
you use to speed up your tests?
First thing to do is to manage to find out why they are too slow, and wether the issue is in the unit tests or in the code under test ?
I'd try to break the test suite into several logical parts, running only the part that is supposedly affected by the code I changed at every compile. I'd run the other suites less often, perhaps once a day, or when in doubt I could have broken something, and at least before integrating.
Some frameworks provide automatic execution of specific unit tests based on heuristics such as last-modified time. For Ruby and Rails, AutoTest provides much faster and responsive execution of the tests -- when I save a Rails model app/models/foo.rb, the corresponding unit tests in test/unit/foo_test.rb get run.
I don't know if anything similar exists for other platforms, but it would make sense.