When to Use Gated Check-In? - build

I am using TFS 2010. Currently I use Gated Check-in build on the trunk (MAIN) branch only. And, I use CI on DEV and RELEASE branches.
Why not use Gated Check-in build on all the branches?
In what scenarios, you shouldn't use Gated Check-in build on DEV and RELEASE branches?
Is it better to always use Gated Check-in build on every branch?

In our very large team, we also do gated in the main branch and CI in the dev/feature branches (many of them).
Gated offers more protection for the branch but with a very large team and large code base, it can back up the queue if the whole dev team is doing changes in that branch.
CI provides protection with a little more trust in the developers also knowing that any issues will get caught quickly. It's a bit more optimistic and allows the team to move much faster which is appropriate for a dev branch.
In both cases, devs run unit tests and test the code they are changing. CI (affects the team) and Gated (consumes time in the queue) should not replace testing - there should be a plausible explanation more complex than I didn't try it.
The whole team is in feature/dev branches using CI for the majority of the cycle and in higher branches with many more folks during end game stabilization - both of those latter conditions support the case for gated.
In a large team, we also need to get the CI builds and the rolling tests to be done in parallel to find issues quicker when build times are not trivial and full test suites are also not trivial. In that scenario, folks are checking in, the CI is picking up the last batch of checkin, running a build and when a build drops another machine is picking up and running the test suites.

There is not really a reason that I know of why not to do a Gated Check-in on every change you make. There is however (in general) a pre-requisite for doing Gated Check-in: your overall build time should not be longer that a few minutes, including any (unit) test you would like to have performed before the check-in is accepted. Otherwise it takes to much time for a check-in to get accepted, or worse for the developer, to get rejected. For a dev team it is also a bit more complex, or at least something to get used to.
Continuous Integration (in my opinion optimized in the form of Rolling builds) allows to have the developer check-in its code without having the uncertainty if it will be accepted or not. Important is that the dev will always have to be confronted as soon as possible with negative end results of a check-in. If you can achieve that, I like it better than Gated check-ins.

I prefer Gated Check-In's everywhere because it limits the pain to the developer checking-in rather than sharing that pain with the entire team when somebody (inevitably) makes a mistake.
As mentioned above, it's important to keep the Gated Check-Ins quick. I will sometimes have a Gated Checkin that runs the most important checks, then a CI build that kicks off after the Gated Checkin succeeds that runs more time-consuming checks.

Related

Are there tools/ methods to objectively measure performance?

I'm writing a high performance application (a raytracer) in C++ using Visual Studio, and I just spent two days trying to root out a performance drop I witnessed after refactoring the code. The reason it took so long was because the performance drop was smaller than the normal variation in execution time I witnessed from run to run.
Not sure if this is normal, but sometimes the program may run at around 33fps pretty consistently, then if you close and rerun, it may run at 37fps. This means that in order to test any new change, I had to manually run and rerun until I witnessed peak performance (And this could require up to like 10 runs). Simply running it for some large number of frames, and measuring the time doesn't fix this variability. For example, if the program runs for 40 seconds on average, it will nevertheless vary by over 1-2 seconds, which makes this test nearly useless for detecting the 1 millisecond per frame performance loss I was dealing with.
Visual Studio's profiling tools also didn't help find this small of an issue, because they also were subject to variation, and in any case, its not necessarily going to tell me the exact offending line, so I have to test solutions, and the profiler is not very effective at confirming a proposed solution's efficacy.
I realize this all may sound like premature optimization, but I don't think it is because I'm optimizing only after finishing complete features; I'm just trying to monitor changes in performance regularly so that issues like the above don't slip in and just get added to the apparent cost of the new feature.
Anyways, my question is simply whether there's a way to objectively determine the "real" speed of an application, discounting the effect of variation. Or, failing that, how do developers deal with such issues? I doubt that my current process is the ideal one.
There are lots of profilers for both c++ and openGL. For those who just need the links, here are they.
OpenGL debugger-profiler
C++ profilers but I recommend Google orbit because it has dark theme.
My eyes stopped at
Objectively measure performance
As you mentioned the speed varies from run to run because it's too complex system. It helps if the scope is small and it only tests some key algorithms. It worth to automatize and collect some reference data. As every scientist say one test is not a test, you should rely on regular tests with controlled environments.
And here comes some tricks that can be used to measure performance.
In the comments others said, an average based on several runs may help you. It softens the noise from the outside.
Process priority or processor affinity could help you control the environment. By giving low priority to other processes your program gains more resource.
Measuring the whole execution of a test and compare it against processor time. As several processes runs at the same time, processor time may differs from execution time.
Update your reference values if you do a software update. Perhaps one update comes with performance boost while other with security patch.
Give a performance range for your program instead of one specific number. Perhaps the temperature messed up your measurement and the clock speed was decreased.
If a test runs too fast to measure, execute the most critical part several times in a test case. Too fast depend on how accurate you can measure. On ms basis it's really hard to decide if a test executed in 2 ms instead of 1 ms is a failure or not. However, if executed 1000 times - 1033 ms compared to 1000 ms gives you better insight.
Only test what is the critical section. Set up the environment and start the stopwatch when everything is ready. The system startup could be another test.

Automated performance tests

At our company we have unit tests.
We are thinking of writing some automated performance tests that will also be part of the test suite, so that both developers and the automated build will run them. The tests will do something and then fail if it took more than some pre-estimated time.
The problem is, different computers have different CPU speeds, and also processes running in the background can slow down execution. So how should we go about these tests?
One strategy is to design your performance metrics for the best machine that code will run on; as long as it runs fast enough on worse machines, you're guaranteed to have better performance in production. Basically, include a fudge factor knowing that it will have to run on slower machines, presumably during testing/development.
Another strategy is to do some benchmarking during your test setup, and use that time amount as your "unit time" instead of using seconds. For example, calculating the 20th Fibonacci number using the dog-slow recursive algorithm, and then saying that all the tests have to run within 10 "20-fibs", so while the wall-clock time is going to be slower on slow machines, you have a machine-independant metric for how well it's running.
Processes running in the background is harder. Obviously you usually don't want other things interfering with your test, so one strategy is to try and eliminate that as much as possible - regular developers can probably kill some processes and run again if there's a failure, and your continuous integration box should be kept relatively clear.
If that doesn't work, or isn't good enough, you could try the opposite approach: run a bunch of CPU/IO intensive processes at the same time as your tests to mimic an overloaded system, and if the tests pass with that environment, the performance should be fine in a normal system
Depending on the limiting resource of your program (I/O, CPU, memory), you can get good results with measuring the used CPU time and comparing it to the system speed. For example, the performance tests for my current program obtain the spent CPU time with time and get the CPU speed from /proc/cpuinfo to measure the number of cycles spent for a computation.
This approach has two caveats: Firstly, it does not measure the achieved parallelity, and secondly, it does not measure external performance factors like I/O usage.
If the idea is to understand how code changes affect performance and ensure that the performance is greater than or equal to previous builds then you need to run the tests on a known hardware profile every time. The most accurate way to do this would be to set up a machine(s) that you use for your testing every single time the tests are executed. If many developers need to do this, sometimes simultaneously, perhaps creating a VM image that they could spin up and point to for the tests to execute on would be worthwhile.
You should not run these on the developers boxes themselves because as you mentioned all kinds of factors could affect the outcome of the tests on those boxes.
You should avoid trying to measure performance while under load/strain from outside of the system being tested, (low disk space, network bandwidth, memory, cpu, etc) unless those conditions are specifically set up as part of the test case. For instance, you can have 3 different test runs, one while the machine is under no load, another where you are under medium load (simulating other programs running in the background) and another under high load.
You can also run tests on various hardware profiles as part of your other stress/performance tests but you probably won't get much value out of running them against every build. Again, however, if you want you could do a few different test runs against different hardware profiles, this requires more setup though since you would need additional machines and/or VM images set up and the infrastructure to kick off the tests against these machines, gather the results and report on them.
+1 for Sam's response. I've done this a number of times in the past and it's critical to lock down your performance test environment and ensure you're minimizing any potential flux.
Running the tests on devs' systems may be a useful flag for individual devs, but having a central system to run the tests on is critical. One caveat about doing this in VMs: ensure you understand the load on the VM host system because load there can impact performance in the hosted VMs.
I've had the best, most consistent and useful results when I ran these sorts of suites during a nightly smoke check build.
It is also a question about tolerances (or acceptable capacity ranges) that will make your tests valid. Ideally, as has been stated, you need a predictable, stable and consistent set up for any useful comparison. That said if you understand the basic operational ranges of the SUT (CPU available, Mem Available etc.) then early developer testing can be done on a mix and match of systems and conditions that are within the known resource tolerances.

Best practices for the best max length of time for running unit tests in CI

We are doing continuous integration at our company with TeamCity and we have unit tests running at every commit (1 min window).
Lately, we are debating on how long a batch of unit tests should last but the shortest the better.
However, I would like to know what is the best practice for the length of a batch of unit tests?
You could build priorities into your unit tests, and only use a subset as a check-in gate (Build Verification Test, or BVT). Run lower priority tests less often (e.g. per daily build, per test pass, or per product release). Then place separate execution time limits on each (or each suite) that satisfies your dev team.
I base priorities on how fast we'd jump on fixing the bug signaled by a test failure. P0 means "must-fix, even if we have to slip the schedule", P3 means "may never fix".
One of the teams I worked on said no more than 2 minutes per feature for BVTs, and placed no time restriction on lower priority tests. The devs had to run about 5 test suites, and it was reasonable with our volume of check-ins to queue up 10 minute buddy builds. But our "unit tests" were big-huge, special-environment-required integration tests, so YMMV.
Unit tests should be run until they are all complete; don't restrict the set of unit tests based upon the runtime. If you have lots of unit tests, and they're taking a long time to run, investigate getting faster hardware to run the CI system on; buying more expensive hardware is far cheaper than not putting in place the unit tests that would detect a problem before it becomes a major bug.
"As short as possible."
But really, it depends on exactly what you're asking. Should you remove tests to shorten the build? Probably not. Might you limit the scope of the tests run on a per-commit build? Maybe so. Should you limit the tests run on a nightly build? Probably not. Exactly how long it's okay for a build to take really depends on your team, your process, and how you integrate CI into them.
Very simple make the (unit-)tests faster or parallelize them...unit tests should work...You shouldn't limit them by runtime...

Unit testing concurrent software - what do you do?

As software gets more and more concurrent, how do you handle testing the core behaviour of the type with your unit tests (not the parallel behaviour, just the core behaviour)?
In the good old days, you had a type, you called it, and you checked either what it returned and/or what other things it called.
Nowadays, you call a method and the actual work gets scheduled to run on the next available thread; you don't know when it'll actually start and call the other things - and what's more, those other things could be concurrent too.
How do you deal with this? Do you abstract/inject the concurrent scheduler (e.g. abstract the Task Parallel Library and provide a fake/mock in the unit tests)?
What resources have you come across that helped you?
Edit
I've edited the question to emphasise testing the normal behaviour of the type (ignoring whatever parallel mechanism is used to take advantage of multi-core, e.g. the TPL)
Disclaimer: I work for Corensic, a small startup in Seattle. We've got a tool called Jinx that is designed to detect concurrency errors in your code. It's free for now while we're in Beta, so you might want to check it out. ( http://www.corensic.com/ )
In a nutshell, Jinx is a very thin hypervisor that, when activated, slips in between the processor and operating system. Jinx then intelligently takes slices of execution and runs simulations of various thread timings to look for bugs. When we find a particular thread timing that will cause a bug to happen, we make that timing "reality" on your machine (e.g., if you're using Visual Studio, the debugger will stop at that point). We then point out the area in your code where the bug was caused. There are no false positives with Jinx. When it detects a bug, it's definitely a bug.
Jinx works on Linux and Windows, and in both native and managed code. It is language and application platform agnostic and can work with all your existing tools.
If you check it out, please send us feedback on what works and doesn't work. We've been running Jinx on some big open source projects and already are seeing situations where Jinx can find bugs 50-100 times faster than simply stress testing code.
I recommend picking up a copy of Growing Object Oriented Software by Freeman and Pryce. The last couple of chapters are very enlightening and deal with this specific topic. It also introduces some terminology which helps in pinning down the notation for discussion.
To summarize ....
Their core idea is to split the functionality and concurrent/synchronization aspects.
First test-drive the functional part in a single synchronous thread like a normal object.
Once you have the functional part pinned down. You can move on to the concurrent aspect. To do that, you'd have to think and come up with "observable invariants w.r.t. concurrency" for your object, e.g. the count should be equal to the times the method is called. Once you have identified the invariants, you can write stress tests that run multiple threads et.all to try and break your invariants. The stress tests assert your invariants.
Finally as an added defence, run tools or static analysis to find bugs.
For passive objects, i.e. code that'd be called from clients on different threads: your test needs to mimic clients by starting its own threads. You would then need to choose between a notification-listening or sampling/polling approach to synchronize your tests with the SUT.
You could either block till you receive an expected notification
Poll certain observable side-effects with a reasonable timeout.
The field of Unit testing for race conditions and deadlocks is relativly new and lacks good tools.
I know of two such tools both in early alpha/beta stages:
Microsoft's Chess
Typemock Racer
ANother option is to try and write a "stress test" that would cause deadlocks/race condtions to surface, create multiople instances/threads and run them side by side. The downside of this approch is that if the test fail it would be very hard to reproduce it. I suggest using logs both in the test and production code so that you'll be able to understand what happened.
A technique I've found useful is to run tests within a tool that detects race conditions like Intel Parallel Inspector. The test runs much slower than normal, because dependencies on timing have to be checked, but a single run can find bugs that otherwise would require millions of repeated ordinary runs.
I've found this very useful when converting existing systems for fine-grained parallelism via multi-core.
Unit tests really should not test concurrency/asynchronous behaviour, you should use mocks there and verify that the mocks receive the expected input.
For integration tests I just explicitly call the background task, then check the expectations after that.
In Cucumber it looks like this:
When I press "Register"
And the email sending script is run
Then I should have an email
Given that your TPL will have its own separate unit test you don't need to verify that.
Given that I write two tests for each module:
1) A single threaded unit test that uses some environment variable or #define to turn of the TPL so that I can test my module for functional correctness.
2) A stress test that runs the module in its threaded deployable mode. This test attempts to find concurrency issues and should use lots of random data.
The second test often includes many modules and so is probably more of an integration/system test.

Unit test execution speed (how many tests per second?)

What kind of execution rate do you aim for with your unit tests (# test per second)? How long is too long for an individual unit test?
I'd be interested in knowing if people have any specific thresholds for determining whether their tests are too slow, or is it just when the friction of a long running test suite gets the better of you?
Finally, when you do decide the tests need to run faster, what techniques do you use to speed up your tests?
Note: integration tests are obviously a different matter again. We are strictly talking unit tests that need to be run as frequently as possible.
Response roundup: Thanks for the great responses so far. Most advice seems to be don't worry about the speed -- concentrate on quality and just selectively run them if they are too slow. Answers with specific numbers have included aiming for <10ms up to 0.5 and 1 second per test, or just keeping the entire suite of commonly run tests under 10 seconds.
Not sure whether it's right to mark one as an "accepted answer" when they're all helpful :)
All unit tests should run in under a second (that is all unit tests combined should run in 1 second). Now I'm sure this has practical limits, but I've had a project with a 1000 tests that run this fast on a laptop. You'll really want this speed so your developers don't dread refactoring some core part of the model (i.e., Lemme go get some coffee while I run these tests...10 minutes later he comes back).
This requirement also forces you to design your application correctly. It means that your domain model is pure and contains zero references to any type of persistance (File I/O, Database, etc). Unit tests are all about testing those business relatonships.
Now that doesn't mean you ignore testing your database or persistence. But these issues are now isolated behind repositories that can be separately tested with integration tests that is located in a separate project. You run your unit tests constantly when writing domain code and then run your integration tests once on check in.
The goal is 100s of tests per second. The way you get there is by following Michael Feather's rules of unit tests.
An important point that came up in a past CITCON discussion is that if your tests aren't this fast it is quite likely that you aren't getting the design benefits of unit testing.
If we're talking strictly unit tests, I'd aim more for completeness than speed. If the run time starts to cause friction, separate the test into different project/classes etc., and only run the tests related to what you're working on. Let the Integration server run all the tests on checkin.
I tend to focus more on readability of my tests than speed. However, I still try to make them reasonably fast. I think if they run on the order of milliseconds, you are fine. If they run a second or more per test... then you might be doing something that should be optimized.
Slow tests only become a problem as the system matures and causes the build to take hours, at which point you are more likely running into an issue of a lot of kind of slow tests rather than one or 2 tests that you can optimize easily... thus you should probably pay attention RIGHT AWAY if you see lots of tests running hundreds of milliseconds each (or worse, seconds each), rather than wait till it gets to the hundreds of tests taking that long point (at which point it is going to be really hard to solve the problem).
Even so, it will only reduce the time between when your automated build issues errors... which is ok if it is an hour later (or even a few hours later), I think. The problem is running them before you check in, but this can be avoided by selecting a small subset of tests to run that are related to what you are working on. Just make sure to fix the build if you check in code that breaks tests you didn't run!
We're currently at 270 tests in around 3.something seconds. There are probably around 8 tests that perform file IO.
These are run automatically upon a successful build of our libraries on every engineers machine. We have more extensive (and time consuming) smoke-testing that is done by the build machine every night, or can be started manually on an engineers machine.
As you can see we haven't yet reached the problem of tests being too time consuming. 10 seconds for me is the point where it starts to become intrusive, when we start to approach that it'll be something we'll take a look at. We'll likely move the lower level libraries, which are more robust since they change infrequently and have few dependencies, into the nightly builds, or a configuration where they're only executed by the build machine.
If you find it's taking more than a few seconds to run a hundred or so tests you may need to examine what you are classifying as a unit test and whether it would be better treated as a smoke test.
your mileage will obviously be highly variable depending on your area of development.
Data Point -- Python Regression Tests
Here are the numbers on my laptop for running "make test" for Python 2.5.2:
number of tests: 3851 (approx)
execution time: 9 min, 6 sec
execution rate: 7 tests / sec
One of the most important rules about unit tests is they should run fast.
How long is too long for an individual unit test?
Developers should be able to run the whole suite of unit tests in seconds, and definitely not in minutes and minutes. Developers should be able to quickly run them after changing the code in anyway. If it takes too long, they won't bother running them and you lose one of the main benefits of the tests.
What kind of execution rate do you aim for with your unit tests (# test per second)?
You should aim for each test to run in an order of milliseconds, anything over 1 second is probably testing too much.
We currently have about 800 tests that run in under 30 seconds, about 27 tests per second. This includes the time to launch the mobile emulator needed to run them. Most of them each take 0-5ms (if I remember correctly).
We have one or two that take about 3 seconds, which are probably candidates for checking, but the important thing is the whole test suite doesn't take so long that it puts off developers running it, and doesn't significantly slow down our continuous integration build.
We also have a configurable timeout limit set to 5 seconds -- anything taking longer will fail.
I judge my unit tests on a per test basis, not by by # of tests per second. The rate I aim for is 500ms or less. If it is above that, I will look into the test to find out why it is taking so long.
When I think a test is to slow, it usually means that it is doing too much. Therefore, just refactoring the test by splitting it up into more tests usually does the trick. The other times that I have noticed my tests running slow is when the test shows a bottleneck in my code, then a refactoring of the code is in order.
How long is too long for an individual
unit test?
I'd say it depends on the compile speed. One usually executes the tests at every compile. The objective of unit testing is not to slow down, but to bring a message "nothing broken, go on" (or "something broke, STOP").
I do not bother about test execution speed until this is something that starts to get annoying.
The danger is to stop running the tests because they're too slow.
Finally, when you do decide the tests
need to run faster, what techniques do
you use to speed up your tests?
First thing to do is to manage to find out why they are too slow, and wether the issue is in the unit tests or in the code under test ?
I'd try to break the test suite into several logical parts, running only the part that is supposedly affected by the code I changed at every compile. I'd run the other suites less often, perhaps once a day, or when in doubt I could have broken something, and at least before integrating.
Some frameworks provide automatic execution of specific unit tests based on heuristics such as last-modified time. For Ruby and Rails, AutoTest provides much faster and responsive execution of the tests -- when I save a Rails model app/models/foo.rb, the corresponding unit tests in test/unit/foo_test.rb get run.
I don't know if anything similar exists for other platforms, but it would make sense.