Help with TDD approach to a real world problem: linker

Help with TDD approach to a real world problem: linker - unit-testing

I'm trying to learn TDD. I've seen examples and discussions about how it's easy to TDD a coffee vending machine firmware from smallest possible functionality up. These examples are either primitive or very well thought-out, it's hard to tell right away. But here's a real world problem.
Linker.
A linker, at its simplest, reads one object file, does magic, and writes one executable file. I don't think I can simplify it further. I do believe the linker design may be evolved, but I have absolutely no idea where to start. Any ideas on how to approach this?
Well, probably the whole linker is too big a problem for the first unit test. I can envision some rough structure beforehand. What a linker does is:
Represents an object file as a collection of segments. Segments contain code, data, symbol definitions and references, debug information etc.
Builds a reference graph and decides which segments to keep.
Packs remaining segments into a contiguous address space according to some rules.
Relocates references.
My main problem is with bullet 1. 2, 3, and 4 basically take a regular data structure and convert it into a platform-dependent mess based on some configuration. I can design that, and the design looks feasible. But 1, it should pick a platform-dependent mess, in one of the several supported formats, and convert it into a regular structure.
The task looks generic enough. It happens everywhere you need to support multiple input formats, be it image processing, document processing, you name it. Is it possible to TDD ? It seems like either test is too simple and I easily hack it to green, or it's a bit more complex and I need to implement the whole object/image/document format reader which is a lot of code. And there is no middle ground.

First, have a look at "Growing Object Oriented Software Guided By Tests" by Freeman & Pryce.
Now, my attempt to answer a difficult question in a few lines.
TDD does require you to think (i.e. design) what you're going to do. You have to:
Think in small steps. Very small steps.
Write a short test, to prove that the next small piece of behaviour works.
Run the test to show that it fails
Do the simplest thing possible to get the test to pass
Refactor ruthlessly to remove duplication and improve the structure of the code
Run the test(s) again to make sure it all still works
Go back to 1.
An initial idea (design) of how your linker might be structured will guide your initial tests. The tests will enforce a modular design (because each test is only testing a single behaviour, and there should be minimal dependencies on other code you've written).
As you proceed you may find your ideas change. The tests you've already written will allow you to refactor with confidence.
The tests should be simple. It is easy to 'hack' a single test to green. But after each 'hack' you refactor. If you see the need for a new class or algorithm during the refactoring, then write tests to drive out its interface. Make sure that the tests only ever test a single behaviour by keeping your modules loosely coupled (dependency injection, abstract base classes, interfaces, function pointers etc.) and use fakes, stubs and mocks to isolate the code under test from the rest of your system.
Finally use 'customer' tests to ensure that you have delivered functional features.
It's a difficult change in mind-set, but a lot of fun and very rewarding. Honest.

You're right, a linker seems a bit bigger than a 'unit' to me, and TDD does not excuse you from sitting down and thinking about how you're going to break down your problem into units. The Sudoku saga is a good illustration of what goes wrong if you don't think first!
Concentrating on your point 1, you have already described a good collection of units (of functionality) by listing the kinds of things that can appear in segments, and hinting that you need to support multiple formats. Why not start by dealing with a simple case like, say, a file containing just a data segment in the binary format of your development platform? You could simply hard-code the file as a binary array in your test, and then check that it interprets just that correctly. Then pick another simple case, and test for that. Keep going.
Now the magic bit is that pretty soon you'll see repeated structures in your code and in your tests, and because you've got tests you can be quite aggressive about refactoring it away. I suspect this is the bit that you haven't experienced yet, because you say "It seems like either test is too simple and I easily hack it to green, or it's a bit more complex and I need to implement the whole object/image/document format reader which is a lot of code. And there is no middle ground." The point is that you should hack them all to green, but as you're doing that you are also searching out the patterns in your hacks.
I wrote a (very simple) compiler in this fashion, and it mostly worked quite well. For each syntactic construction, I wrote the smallest program that I could think of which used it in some observable way, and had the test compile the program and check that it worked as expected. I used a proper parser generator as you can't plausibly TDD your way into one of them (you need to use a little forethought!) After about three cycles, it became obvious that I was repeating the code to walk the syntax tree, so that was refactored into something like a Visitor.
I also had larger-scale acceptance tests, but in the end I don't think these caught much that the unit tests didn't.

This is all very possible.
A sample from the top of my head is NHAML.
This is ASP.NET ViewEngine that converts plain text to the .NET native code.
You can have a look at source code and see how it is tested.

I guess what I do is come up with layers and blocks and sub-divide to the point where I might be thinking about code and then start writing tests.
I think your tests should be quite simple: it's not the individual tests that are the power of TDD but the sum of the tests.
One of the principles I follow is that a method should fit on a screen - when that's the case, the tests are usually simple enough.
Your design should allow you to mock out lower layers so that you're only testing one layer.

TDD is about specification, not test.
From your simplest spec of a linker, your TDD test has just to check whether an executable file has been created during the linker magic if you feed it with an object file.
Then you write a linker that makes your test succeed, e.g.:
check whether input file is an object file
if so, generate a "Hello World!" executable (note that your spec didn't specify that different object files would produce different executables)
Then you refine your spec and your TDD (these are your four bullets).
As long as you can write a specification you can write TDD test cases.

Related

Is my code really not unit-testable?

A lot of code in a current project is directly related to displaying things using a 3rd-party 3D rendering engine. As such, it's easy to say "this is a special case, you can't unit test it". But I wonder if this is a valid excuse... it's easy to think "I am special" but rarely actually the case.
Are there types of code which are genuinely not suited for unit-testing? By suitable, I mean "without it taking longer to figure out how to write the test than is worth the effort"... dealing with a ton of 3D math/rendering it could take a lot of work to prove the output of a function is correct compared with just looking at the rendered graphics.

Code that directly relates to displaying information, generating images and even general UI stuff, is sometimes hard to unit-test.
However that mostly applies only to the very top level of that code. Usually 1-2 method calls below the "surface" is code that's easily unit tested.
For example, it may be nontrivial to test that some information is correctly animated into the dialog box when a validation fails. However, it's very easy to check if the validation would fail for any given input.
Make sure to structure your code in a way that the "non-testable" surface area is well-separated from the test and write extensive tests for the non-surface code.

The point of unit-testing your rendering code is not to demonstrate that the third-party-code does the right thing (that is for integration and regression testing). The point is to demonstrate that your code gives the right instructions to the third-party code. In other words, you only have to control the input of your code layer and verify the output (which would become the input of the renderer).
Of course, you can create a mock version of the renderer which does cheap ASCII graphics or something, and then verify the pseudo-graphics if you want and this makes the test clearer if you want, but it is not strictly necessary for a unit test of your code.

If you cannot break your code into units, it is very hard to unit test.
My guess would be that if you have 3D atomic functions (say translate, rotate,
and project a point) they should be easily testable - create a set of test points and test whether the transformation takes a point to where it should.
If you can only reach the 3D code through a limited API, then it would be hard to test.
Please see Misko Hevery's Testability posts and his testability guide.

I think this is a good question. I wrestle with this all the time, and it seems like there are certain types of code that fit into the unit testing paradigm and other types that do not.
What I consider clearly unit-testable is code that obviously has room for being wrong. Examples:
Code to compute hairy math or linear algebra functions. I always write an auxiliary function to check the answers, and run it once in a while.
Hairy data structure code, with cross-references, guids, back-pointers, and methods for incrementally keeping it consistent. These are really easy to break, so unit tests are good for seeing if they are broken.
On the other hand, in code with low redundancy, if the code compiles it may not be clear what being wrong even means. For example, I do pretty complicated UIs using dynamic dialogs, and it's not clear what to test. All the kinds of things like event handling, layout, and showing / hiding / updating of controls that might make this code error-prone are simply dealt with in a well-verified layer underneath.
The kind of testing I find myself needing more than unit-testing is coverage testing. Have I tried all the possible features and combinations of features? Since this is a very large space and it is prohibitive to write automated tests to cover it, I often find myself doing monte-carlo testing instead, where feature selections are chosen at random and submitted to the system. Then the result is examined in an automatic and / or manual way.

If you can grab the rendered image, you can unit test it.
Simply render some images with the current codebase, see if they "look right" (examining them down to the pixel if you have to), and store them for comparison. Your unit tests could then compare to those stored images and see if the result is the same.
Whether or not this is worth the trouble, that's for you to decide.

Break down the rendering into steps and test by comparing the frame buffer for each step to a known good images.
No matter what you have, it can be broken down to numbers which can be compared. The real trick is when you havbe some random number generator in the algorithm, or some other nondeterministic part.
With things like floating point, you might need to subtract the generated data from the expected data and check that the difference is less than some error threshold.

Well you can't unit test certain kinds of exception code but other than that ...
I've got true unit tests for some code that looks impossible to even attach a test harness to and code that looks like it should be unit testable but isn't.
One of the ways you know your code is not unit testable is when it depends on the physical characteristics of the device it runs on. Another kind of not unit-testable code is direct UI code (and I find a lot of breaks in direct UI code).
I've also got a huge chunk of non unit-testable code that has appropriate integration tests.

TDD with unclear requirements

I know that TDD helps a lot and I like this method of development when you first create a test and then implement the functionality. It is very clear and correct way.
But due to some flavour of my projects it often happens that when I start to develop some module I know very little about what I want and how it will look at the end. The requirements appear as I develop, there may be 2 or 3 iterations when I delete all or part of the old code and write new.
I see two problems:
1. I want to see the result as soon as possible to understand are my ideas right or wrong. Unit tests slow down this process. So it often happens that I write unit tests after the code is finished what is known to be a bad pattern.
2. If I first write the tests I need to rewrite not only the code twice or more times but also the tests. It takes much time.
Could someone please tell me how can TDD be applied in such situation?
Thanks in advance!

I want to see the result as soon as possible to understand are my ideas right or wrong. Unit tests slow down this process.
I disagree. Unit tests and TDD can often speed up getting results because they force you to concentrate on the results rather than implementing tons of code that you might never need. It also allows you to run the different parts of your code as you write them so you can constantly see what results you are getting, rather than having to wait until your entire program is finished.

I find that TDD works particularly well in this kind of situation; in fact, I would say that having unclear and/or changing requirements is actually very common.
I find that the best uses of TDD is ensuring that your code is doing what you expect it to do. When you're writing any code, you should know what you want it to do, whether the requirements are clear or not. The strength of TDD here is that if there is a change in the requirements, you can simply change one or more of your unit tests to reflect the changed requirements, and then update your code while being sure that you're not breaking other (unchanged) functionality.
I think that one thing that trips up a lot of people with TDD is the assumption that all tests need to be written ahead of time. I think it's more effective to use the rule of thumb that you never write any implementation code while all of your tests are passing; this simply ensures that all code is covered, while also ensuring that you're checking that all code does what you want it to do without worrying about writing all your tests up front.

IMHO, your main problem is when you have to delete some code. This is waste and this is what shall be addressed first.
Perhaps you could prototype, or utilize "spike solutions" to validate the requirements and your ideas then apply TDD on the real code, once the requirements are stable.
The risk is to apply this and to have to ship the prototype.
Also you could test-drive the "sunny path" first and only implement the remaining such as error handling ... after the requirements have been pinned down. However the second phase of the implementation will be less motivating.
What development process are you using ? It sounds agile as you're having iterations, but not in an environment that fully supports it.

TDD will, for just about anybody, slow down initial development. So, if initial development speed is 10 on a 1-10 scale, with TDD you might get around an 8 if you're proficient.
It's the development after that point that gets interesting. As projects get larger, development efficiency typically drops - often to 3 on the same scale. With TDD, it's very possible to still stay in the 7-8 range.
Look up "technical debt" for a good read. As far as I'm concerned, any code without unit tests is effectively technical debt.

TDD helps you to express the intent of your code. This means that writing the test, you have to say what you expect from your code. How your expectations are fulfilled is then secondary (this is the implementation). Ask yourself the question: "What is more important, the implementation, or what the provided functionality is?" If it is the implementation, then you don't have to write the tests. If it is the functionality provided then writing the tests first will help you with this.
Another valuable thing is that by TDD, you will not implement functionality that will not be needed. You only write code that needs to satisfy the intent. This is also called YAGNI (You aint gonna need it).

There's no getting away from it - if you're measuring how long it takes to code just by how long it takes you to write classes, etc, then it'll take longer with TDD. If you're experienced it'll add about 15%, if you're new it'll take at least 60% longer if not more.
BUT, overall you'll be quicker. Why?
by writing a test first you're specifying what you want and delivering just that and nothing more - hence saving time writing unused code
without tests, you might think that the results are so obvious that what you've done is correct - when it isn't. Tests demonstrate that what you've done is correct.
you will get faster feedback from automated tests than by doing manual testing
with manual testing the time taken to test everything as your application grows increases rapidly - which means you'll stop doing it
with manual tests it's easy to make mistakes and 'see' something passing when it isn't, this is especially true if you're running them again and again and again
(good) unit tests give you a second client to your code which often highlights design problems that you might miss otherwise
Add all this up and if you measure from inception to delivery and TDD is much, much faster - you get fewer defects, you're taking fewer risks, you progress at a steady rate (which makes estimation easier) and the list goes on.
TDD will make you faster, no question, but it isn't easy and you should allow yourself some space to learn and not get disheartened if initially it seems slower.
Finally you should look at some techniques from BDD to enhance what you're doing with TDD. Begin with the feature you want to implement and drive down into the code from there by pulling out stories and then scenarios. Concentrate on implementing your solution scenario by scenario in thin vertical slices. Doing this will help clarify the requirements.

Using TDD could actually make you write code faster - not being able to write a test for a specific scenario could mean that there is an issue in the requirements.
When you TDD you should find these problematic places faster instead of after writing 80% of your code.
There are a few things you can do to make your tests more resistant to change:
You should try to reuse code inside
your tests in a form of factory
methods that creates your test
objects along with verify methods
that checks the test result. This
way if some major behavior change
occurs in your code you have less
code to change in your test.
Use IoC container instead of passing
arguments to your main classes -
again if the method signature
changes you do not need to change
all of your tests.
Make your unit tests short and Isolated - each test should check only one aspect of your code and use Mocking/Isolation framework to make the test independent of external objects.
Test and write code for only the required feature (YAGNI). Try to ask yourself what value my customer will receive from the code I'm writing. Don't create overcomplicated architecture instead create the needed functionality piece by piece while refactoring your code as you go.

Here's a blog post I found potent in explaining the use of TDD on a very iterative design process scale: http://blog.extracheese.org/2009/11/how_i_started_tdd.html.

Joshua Block commented on something similar in the book "Coders at work". His advice was to write examples of how the API would be used (about a page in length). Then think about the examples and the API a lot and refactor the API. Then write the specification and the unit tests. Be prepared, however, to refactor the API and rewrite the spec as you implement the API.

When I deal with unclear requirements, I know that my code will need to change. Having solid tests helps me feel more comfortable changing my code. Practising TDD helps me write solid tests, and so that's why I do it.
Although TDD is primarily a design technique, it has one great benefit in your situation: it encourages the programmer to consider details and concrete scenarios. When I do this, I notice that I find gaps or misunderstandings or lack of clarity in requirements quite quickly. The act of trying to write tests forces me to deal with the lack of clarity in the requirements, rather than trying to sweep those difficulties under the rug.
So when I have unclear requirements, I practise TDD both because it helps me identify the specific requirements issues that I need to address, but also because it encourages me to write code that I find easier to change as I understand more about what I need to build.

In this early prototype-phase I find it to be enough to write testable code. That is, when you write your code, think of how to make it possible to test, but for now, focus on the code itself and not the tests.
You should have the tests in place when you commit something though.

Refactoring and Test Driven Development

I'm Currently reading two excellent books "Working Effectively with Legacy Code" and "Clean Code".
They are making me think about the way I write and work with code in completely new ways but one theme that is common among them is test driven development and the idea of smothering everything with tests and having tests in place before you make a change or implement a new piece of functionality.
This has led to two questions:
Question 1:
If I am working with legacy code. According to the books I should put tests in place to ensure I'm not breaking anything. Consider that I have a method 500 lines long. I would assume I'll have a set of equivalent testing methods to test that method. When I split this function up, do I create new tests for each new method/class that results?
According to "Clean Code" any test that takes longer than 1/10th of a second is a test that takes too long. Trying to test a 500 long line legacy method that goes to databases and does god knows what else could well take longer than 1/10th of a second. While I understand you need to break dependencies what I'm having trouble with is the initial test creation.
Question 2:
What happens when the code is re-factored so much that structurally it no longer resembles the original code (new parameters added/removed to methods etc). It would follow that the tests will need re-factoring also? In that case you could potentially altering the functionality of the system while the allowing the tests to keep passing? Is re-factoring tests an appropriate thing to do in this circumstance?
While its ok to plod on with assumptions I was wondering whether there are any thoughts/suggestions on such matters from a collective experience.

That's the deal when working with legacy code. Legacy meaning a system with no tests and which is tightly coupled. When adding tests for that code, you are effectively adding integration tests. When you refactor and add the more specific test methods that avoid the network calls, etc those would be your unit tests. You want to keep both, just have then separate, that way most of your unit tests will run fast.
You do that in really small steps. You actually switch continually between tests and code, and you are correct, if you change a signature (small step) related tests need to be updated.
Also check my "update 2" on How can I improve my junit tests. It isn't about legacy code and dealing with the coupling it already has, but on how you go about writing logic + tests where external systems are involved i.e. databases, emails, etc.

The 0.1s unit test run time is fairly silly. There's no reason unit tests shouldn't use a network socket, read a large file or other hefty operations if they have to. Yes it's nice if the tests run quickly so you can get on with the main job of writing the application but it's much nicer to end up with the best result at the end and if that means running a unit test that takes 10s then that's what I'd do.
If you're going to refactor the key is to spend as much time as you need to understand the code you are refactoring. One good way of doing that would be to write a few unit tests for it. As you grasp what certain blocks of code are doing you could refactor it and then it's good practice to write tests for each of your new methods as you go.

Yes, create new tests for new methods.
I'd see the 1/10 of a second as a goal you should strive for. A slower test is still much better than no test.
Try not to change the code and the test at the same time. Always take small steps.

When you've got a lengthy legacy method that does X (and maybe Y and Z because of its size), the real trick is not breaking the app by 'fixing' it. The tests on the legacy app have preconditions and postconditions and so you've got to really know those before you go breaking it up. The tests help to facilitate that. As soon as you break that method into two or more new methods, obviously you need to know the pre/post states for each of those and so tests for those 'keep you honest' and let you sleep better at night.
I don't tend to worry too much about the 1/10th of a second assertion. Rather, the goal when I'm writing unit tests is to cover all my bases. Obviously, if a test takes a long time, it might be because what is being tested is simply way too much code doing way too much.
The bottom line is that you definitely don't want to take what is presumably a working system and 'fix' it to the point that it works sometimes and fails under certain conditions. That's where the tests can help. Each of them expects the world to be in one state at the beginning of the test and a new state at the end. Only you can know if those two states are correct. All the tests can 'pass' and the app can still be wrong.
Anytime the code gets changed, the tests will possibly change and new ones will likely need to be added to address changes made to the production code. Those tests work with the current code - doesn't matter if the parameters needed to change, there are still pre/post conditions that have to be met. It isn't enough, obviously, to just break up the code into smaller chunks. The 'analyst' in you has to be able to understand the system you are building - that's job one.
Working with legacy code can be a real chore depending on the 'mess' you start with. I really find that knowing what you've got and what it is supposed to do (and whether it actually does it at step 0 before you start refactoring it) is key to a successful refactoring of the code. One goal, I think, is that I ought to be able to toss out the old stuff, stick my new code in its place and have it work as advertised (or better). Depending on the language it was written in, the assumptions made by the original author(s) and the ability to encapsulate functionality into containable chunks, it can be a real trick.
Best of luck!

Here's my take on it:
No and yes. First things first is to have a unit test that checks the output of that 500 line method. And then that's only when you begin thinking of splitting it up. Ideally the process will go like this:
Write a test for the original legacy 500-line behemoth
Figure out, marking first with comments, what blocks of code you could extract from that method
Write a test for each block of code. All will fail.
Extract the blocks one by one. Concentrate on getting all the methods go green one at a time.
Rinse and repeat until you've finished the whole thing
After this long process you will realize that it might make sense that some methods be moved elsewhere, or are repetitive and several can be reduced to a single function; this is how you know that you succeeded. Edit tests accordingly.
Go ahead and refactor, but as soon as you need to change signatures make the changes in your test first before you make the change in your actual code. That way you make sure that you're still making the correct assertions given the change in method signature.

Question 1: "When I split this function up, do I create new tests for each new method/class that results?"
As always the real answer is it depends. If it is appropriate, it may be simpler when refactoring some gigantic monolithic methods into smaller methods that handle different component parts to set your new methods to private/protected and leave your existing API intact in order to continue to use your existing unit tests. If you need to test your newly split off methods, sometimes it is advantageous to just mark them as package private so that your unit testing classes can get at them but other classes cannot.
Question 2: "What happens when the code is re-factored so much that structurally it no longer resembles the original code?"
My first piece of advice here is that you need to get a good IDE and have a good knowledge of regular expressions - try to do as much of your refactoring using automated tools as possible. This can help save time if you are cautious enough not to introduce new problems. As you said, you have to change your unit tests - but if you used good OOP principals with the (you did right?), then it shouldn't be so painful.
Overall, it is important to ask yourself with regards to the refactor do the benefits outweigh the costs? Am I just fiddling around with architectures and designs? Am I doing a refactor in order to understand the code and is it really needed? I would consult a coworker who is familiar with the code base for their opinion on the cost/benefits of your current task.
Also remember that the theoretical ideal you read in books needs to be balanced with real world business needs and time schedules.

Testing When Correctness is Poorly Defined?

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions.
However, a lot of the code I write is data mining code that basically looks for significant patterns in large datasets. Correct behavior in this case is often not well defined and depends on a lot of different inputs in ways that are not easy for a human to predict (i.e. the math can't reasonably be done by hand, which is why I'm using a computer to solve the problem in the first place). These inputs can be very complex, to the point where coming up with a reasonable test case is near impossible. Identifying the edge cases that are worth testing is extremely difficult. Sometimes the algorithm isn't even deterministic.
Usually, I do the best I can by using asserts for sanity checks and creating a small toy test case with a known pattern and informally seeing if the answer at least "looks reasonable", without it necessarily being objectively correct. Is there any better way to test these kinds of cases?

I think you just need to write unit tests based on small sets of data that will make sure that your code is doing exactly what you want it to do. If this gives you a reasonable data-mining algorithm is a separate issue, and I don't think it is possible to solve it by unit tests. There are two "levels" of correctness of your code:
Your code is correctly implementing the given data mining algorithm (this thing you should unit-test)
The data mining algorithm you implement is "correct" - solves the business problem. This is a quite open question, it probably depends both on some parameters of your algorithm as well as on the actual data (different algorithms work for different types of data).

When facing cases like this I tend to build one or more stub data sets that reflect the proper underlying complexities of the real-life data. I often do this together with the customer, to make sure I capture the essence of the complexities.
Then I can just codify these into one or more datasets that can be used as basis for making very specific unit tests (sometimes they're more like integration tests with stub data, but I don't think that's an important distinction). So while your algorithm may have "fuzzy" results for a "generic" dataset, these algorithms almost always have a single correct answer for a specific dataset.

Well, there are a few answers.
First of all, as you mentioned, take a small case study, and do the math by hand. Since you wrote the algorithm, you know what it's supposed to do, so you can do it in a limited case.
The other one is to break down every component of your program into testable parts.
If A calls B calls C calls D, and you know that A,B,C,D, all give the right answer, then you test A->B, B->C, and C->D, then you can be reasonably sure that A->D is giving the correct response.
Also, if there are other programs out there that do what you are looking to do, try and aquire their datasets. Or an opensource project that you could use test data against, and see if your application is giving similar results.
Another way to test datamining code is by taking a test set, and then introducing a pattern of the type you're looking for, and then test again, to see if it will separate out the new pattern from the old ones.
And, the tried and true, walk through your own code by hand and see if the code is doing what you meant it to do.

Really, the challenge here is this: because your application is meant to do a fuzzy, non-deterministic kind of task in a smart way, the very goal you hope to achieve is that the application becomes better than human beings at finding these patterns. That's great, powerful, and cool ... but if you pull it off, then it becomes very hard for any human beings to say, "In this case, the answer should be X."
In fact, ideally the computer would say, "Not really. I see why you think that, but consider these 4.2 terabytes of information over here. Have you read them yet? Based on those, I would argue that the answer should be Z."
And if you really succeeded in your original goal, the end user might sometimes say, "Zowie, you're right. That is a better answer. You found a pattern that is going to make us money! (or save us money, or whatever)."
If such a thing could never happen, then why are you asking the computer to detect these kinds of patterns in the first place?
So, the best thing I can think of is to let real life help you build up a list of test scenarios. If there ever was a pattern discovered in the past that did turn out to be valuable, then make a "unit test" that sees if your system discovers it when given similar data. I say "unit test" in quotes because it may be more like an integration test, but you may still choose to use NUnit or VS.Net or RSpec or whatever unit test tools you're using.
For some of these tests, you might somehow try to "mock" the 4.2 terabytes of data (you won't really mock the data, but at some higher level you'd mock some of the conclusions reached from that data). For others, maybe you have a "test database" with some data in it, from which you expect a set of patterns to be detected.
Also, if you can do it, it would be great if the system could "describe its reasoning" behind the patterns it detects. This would let the business user deliberate over the question of whether the application was right or not.

This is tricky. This sounds similar to writing tests around our text search engine. If you keep struggling, you'll figure something out:
Start with a small, simplified but reasonably representative data sample, and test basic behavior doing this
Rather than asserting that the output is exactly some answer, sometimes it's better to figure out what is important about it. For example, for our search engine, I didn't care so much about the exact order the documents were listed, as long as the three key ones were on the first page of results.
As you make a small, incremental change, figure out what the essence of it is and write a test for that. Even though the overall calculations take many inputs, individual changes to the codebase should be isolatable. For example, we found certain documents weren't being surfaced because of the presence of hyphens in some of the key words. We created tests that testing that this was behaving how we expected.
Look at tools like Fitness, which allow you to throw a large number of datasets at a piece of code and assert things about the results. This may be easier to understand than more traditional unit tests.
I've gone back to the product owner, saying "I can't understand how this will work. How will we know if it's right?" Maybe s/he can articulate the essence of the vaguely defined problem. This has worked really well for me many times, and I've talked people out of features because they couldn't be explained.
Be creative!

Ultimately, you have to decide what your program should be doing, and then test for that.

Should unit tests be written before the code is written?

I know that one of the defining principles of Test driven development is that you write your Unit tests first and then write code to pass those unit tests, but is it necessary to do it this way?
I've found that I often don't know what I am testing until I've written it, mainly because the past couple of projects I've worked on have more evolved from a proof of concept rather than been designed.
I've tried to write my unit tests before and it can be useful, but it doesn't seem natural to me.

Some good comments here, but I think that one thing is getting ignored.
writing tests first drives your design. This is an important step. If you write the tests "at the same time" or "soon after" you might be missing some design benefits of doing TDD in micro steps.
It feels really cheesy at first, but it's amazing to watch things unfold before your eyes into a design that you didn't think of originally. I've seen it happen.
TDD is hard, and it's not for everybody. But if you already embrace unit testing, then try it out for a month and see what it does to your design and productivity.
You spend less time in the debugger and more time thinking about outside-in design. Those are two gigantic pluses in my book.

There have been studies that show that unit tests written after the code has been written are better tests. The caveat though is that people don't tend to write them after the event. So TDD is a good compromise as at least the tests get written.
So if you write tests after you have written code, good for you, I'd suggest you stick at it.
I tend to find that I do a mixture. The more I understand the requirements, the more tests I can write up front. When the requirements - or my understanding of the problem - are weak, I tend to write tests afterwards.

TDD is not about the tests, but how the tests drive your code.
So basically you are writing tests to let an architecture evolve naturally (and don't forget to refactor !!! otherwise you won't get much benefit out of it).
That you have an arsenal of regression tests and executable documentation afterwards is a nice sideeffect, but not the main reason behind TDD.
So my vote is:
Test first
PS: And no, that doesn't mean that you don't have to plan your architecture before, but that you might rethink it if the tests tell you to do so !!!!

I've lead development teams for the past 6-7 years. What I can tell for sure is that as a developer and the developers I have worked with, it makes a phenomenal difference in the quality of the code if we know where our code fits into the big picture.
Test Driven Development (TDD) helps us answer "What?" before we answer "How?" and it makes a big difference.
I understand why there may be apprehensions about not following it in PoC type of development/architect work. And you are right it may not make a complete sense to follow this process. At the same time, I would like to emphasize that TDD is a process that falls in the Development Phase (I know it sounds obsolete, but you get the point :) when the low level specification are clear.

I think writing the test first helps define what the code should actually do. Too many times people don't have a good definition of what the code is supposed to do or how it should work. They simply start writing and make it up as they go along. Creating the test first makes you focus on what the code will do.

Not always, but I find that it really does help when I do.

I tend to write them as I write my code. At most I will write the tests for if the class/module exists before I write it.
I don't plan far enough ahead in that much detail to write a test earlier than the code it is going to test.
I don't know if this is a flaw in my thinking or method's or just TIMTOWTDI.

I start with how I would like to call my "unit" and make it compile.
like:
picker = Pick.new
item=picker.pick('a')
assert item
then I create
class Pick
def pick(something)
return nil
end
end
then I keep on using the Pick in my "test" case so I could see how I would like it to be called and how I would treat different kinds of behavior. Whenever I realize I could have trouble on some boundaries or some kind of error/exception I try to get it to fire and get an new test case.
So, in short. Yes.
The ratio doing test before is a lot higher than not doing it.

Directives are suggestion on how you could do things to improve the overall quality or productivity or even both of the end product. They are in no ways laws to be obeyed less you get smitten in a flash by the god of proper coding practice.
Here's my compromise on the take and I found it quite useful and productive.
Usually the hardest part to get right are the requirements and right behind it the usability of your class, API, package... Then is the actual implementation.
Write your interfaces (they will change, but will go a long way in knowing WHAT has to be done)
Write a simple program to use the interfaces (them stupid main). This goes a long way in determining the HOW it is going to be used (go back to 1 as often as needed)
Write tests on the interface (The bit I integrated from TDD, again go back to 1 as often as needed)
write the actual code behind the interfaces
write tests on the classes and the actual implementation, use a coverage tool to make sure you do not forget weid execution paths
So, yes I write tests before coding but never before I figured out what needs to be done with a certain level of details. These are usually high level tests and only treat the whole as a black box. Usually will remain as integration tests and will not change much once the interfaces have stabilized.
Then I write a bunch of tests (unit tests) on the implementation behind it, these will be much more detailed and will change often as the implementation evolves, as it get's optimized and expanded.
Is this strictly speaking TDD ? Extreme ? Agile...? whatever... ? I don't know, and frankly I don't care. It works for me. I adjust it as needs go and as my understanding of software development practice evolve.
my 2 cent

I've been programming for 20 years, and I've virtually never written a line of code that I didn't run some kind of unit test on--Honestly I know people do it all the time, but how someone can ship a line of code that hasn't had some kind of test run on it is beyond me.
Often if there is no test framework in place I just write a main() into each class I write. It adds a little cruft to your app, but someone can always delete it (or comment it out) if they want I guess. I really wish there was just a test() method in your class that would automatically compile out for release builds--I love my test method being in the same file as my code...
So I've done both Test Driven Development and Tested development. I can tell you that TDD can really help when you are a starting programmer. It helps you learn to view your code "From outside" which is one of the most important lessons a programmer can learn.
TDD also helps you get going when you are stuck. You can just write some very small piece that you know your code has to do, then run it and fix it--it gets addictive.
On the other hand, when you are adding to existing code and know pretty much exactly what you want, it's a toss-up. Your "Other code" often tests your new code in place. You still need to be sure you test each path, but you get a good coverage just by running the tests from the front-end (except for dynamic languages--for those you really should have unit tests for everything no matter what).
By the way, when I was on a fairly large Ruby/Rails project we had a very high % of test coverage. We refactored a major, central model class into two classes. It would have taken us two days, but with all the tests we had to refactor it ended up closer to two weeks. Tests are NOT completely free.

I'm not sure, but from your description I sense that there might be a misunderstanding on what test-first actually means. It does not mean that you write all your tests first. It does mean that you have a very tight cycle of
write a single, minimal test
make the test pass by writing the minimal production code necessary
write the next test that will fail
make all the existing tests pass by changing the existing production code in the simplest possible way
refactor the code (both test and production!) so that it doesn't contain duplication and is expressive
continue with 3. until you can't think of another sensible test
One cycle (3-5) typically just takes a couple of minutes. Using this technique, you actually evolve the design while you write your tests and production code in parallel. There is not much up front design involved at all.
On the question of it being "necessary" - no, it obviously isn't. There have been uncountable projects successfull without doing TDD. But there is some strong evidence out there that using TDD typically leads to significantly higher quality, often without negative impact on productivity. And it's fun, too!
Oh, and regarding it not feeling "natural", it's just a matter of what you are used to. I know people who are quite addicted to getting a green bar (the typical xUnit sign for "all tests passing") every couple of minutes.

There are so many answers now and they are all different. This perfectly resembles the reality out there. Everyone is doing it differently. I think there is a huge misunderstanding about unit testing. It seems to me as if people heard about TDD and they said it's good. Then they started to write unit tests without really understanding what TDD really is. They just got the part "oh yeah we have to write tests" and they agree with it. They also heard about this "you should write your tests first" but they do not take this serious.
I think it's because they do not understand the benefits of test-first which in turn you can only understand once you've done it this way for some time. And they always seem to find 1.000.000 excuses why they don't like writing the tests first. Because it's too difficult when figuring out how everything will fit together etc. etc. In my opinion, it's all excuses for them to hide away from their inability to once discipline themselve, try the test-first approach and start to see the benefits.
The most ridicoulous thing if they start to argue "I'm not conviced about this test-first thing but I've never done it this way" ... great ...
I wonder where unit testing originally comes from. Because if the concept really originates from TDD then it's just ridicoulous how people get it wrong.

Writing the tests first defines how your code will look like - i.e. it tends to make your code more modular and testable, so you do not create a "bloat" methods with very complex and overlapping functionality. This also helps to isolate all core functionality in separate methods for easier testing.

Personally, I believe unit tests lose a lot of their effectiveness if not done before writing the code.
The age old problem with testing is that no matter how hard we think about it, we will never come up with every possibly scenario to write a test to cover.
Obviously unit testing itself doesn't prevent this completely, as it restrictive testing, looking at only one unit of code not covering the interactions between this code and everything else, but it provides a good basis for writing clean code in the first place that should at least restrict the chances for issues of interaction between modules. I've always worked to the principle of keeping code as simple as it possibly can be - infact I believe this is one of the key principles of TDD.
So starting off with a test that basically says you can create a class of this type and build it up, in theory, writing a test for every line of code or at least covering every route through a particular piece of code. Designing as you go! Obviously based on a rough-up-front design produced initially, to give you a framework to work to.
As you say it is very unnatural to start with and can seem like a waste of time, but I've seen myself first hand that it pays off in the long run when defects stats come through and show the modules that were fully written using TDD have far lower defects over time than others.

Before, during and after.
Before is part of the spec, the contract, the definition of the work
During is when special cases, bad data, exceptions are uncovered while implementing.
After is maintenance, evolution, change, new requirements.

I don't write the actual unit tests first, but I do make a test matrix before I start coding listing all the possible scenarios that will have to be tested. I also make a list of cases that will have to be tested when a change is made to any part of the program as part of regression testing that will cover most of the basic scenarios in the application in addition to fully testing the bit of code that changed.

Remember with Extreme programming your tests effectly are you documenation. So if you don't know what you're testing, then you don't know what you want your application is going to do?
You can start off with "Stories" which might be something like
"Users can Get list of Questions"
Then as you start writing code to solve the unit tests. To solve the above you'll need at least a User and question class. So then you can start thinking about the fields:
"User Class Has Name DOB Address TelNo Locked Fields"
etc.
Hope it helps.
Crafty

Yes, if you are using true TDD principles. Otherwise, as long as you're writing the unit-tests, you're doing better than most.
In my experience, it is usually easier to write the tests before the code, because by doing it that way you give yourself a simple debugging tool to use as you write the code.

I write them at the same time. I create the skeleton code for the new class and the test class, and then I write a test for some functionality (which then helps me to see how I want the new object to be called), and implement it in the code.
Usually, I don't end up with elegant code the first time around, it's normally quite hacky. But once all the tests are working, you can refactor away until you end up with something pretty neat, tidy and proveable to be rock solid.

It helps when you are writing something that you are used writing to write first all the thing you would regularly check for and then write those features. More times then not those features are the most important for the piece of software you are writing. Now , on the other side there are not silver bullets and thing should never be followed to the letter. Developer judgment plays a big role in the decision of using test driven development versus test latter development.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js