Unit testing a compiler - unit-testing

What is considered the best approach to unit test a complex unit such as a compiler?
I've written a few compilers and interpreters over the years, and I do find this kind of code quite hard to test in a good way.
If we take something like the Abstract Syntax Tree generation. how would you test this using TDD?
Small constructs might be easy to test.
e.g. something along the lines of:
string code = #"public class Foo {}";
AST ast = compiler.Parse(code);
Since that won't generate alot of ast nodes.
But if I actually want to test that the compiler can generate an AST for something like a method:
[TestMethod]
public void Can_parse_integer_instance_method_in_class ()
{
string code = #"public class Foo { public int method(){ return 0;}}";
AST ast = compiler.Parse(code);
What would you assert on?
Manually defining an AST that represents the given code and make an assertion that the generated AST conforms to the manually defined AST seems horribly combersome and might even be error prone.
So what are the best tactics for TDD'ing complex scenarios like this?

Firstly, if you test a compiler, you cannot get enough tests! Users really rely on compiler-generated output as if it were an always-golden-standard, so really be aware of quality. So if you can, test with every test you can come up with!
Secondly, use all testing methods available, yet use them where appropriate. Indeed, you may be able to mathematically prove, that a certain transformation is correct. If you are able to do so, you should.
But every compiler that I've seen some internals of involves heuristics and a lot of optimized, hand-crafted code at its internals; thus assisted proving methods typically are not applicable any more. Here, testing comes in place, and I mean a lot of it!
When collecting tests, please consider different cases:
Positive Standard-Conformance: your frontend should accept certain code patterns and the compiler must produce a correctly running program thereof. Tests in this category either need a golden-reference compiler or generator that produces the correct output of the test-program; or it involves hand-written programs that include a check against values furnished by human reasoning.
Negative Tests: every compiler has to reject faulty code, like syntax errors, type mismatches et cetera. It must produce certain types of error and warning messages. I don't know of any method to auto-generate such tests. So these need to be human-written, too.
Transformation tests: whenever you come up with a fancy optimization within your compiler (middle-end), you probably have some example code in mind that demonstrates the optimization. Be aware of transformations before and after such a module, they might need special options to your compiler or a bare-bone compiler with just that module plugged-in. Test a reasonable big set of surrounding module combinations, too. I usually did regression-testing on the intermediate representation before and after the specific transformation, defining a reference by intensive reasoning with colleagues. Try to write code on both sides of the transformation, i.e. code snippets that you want to have transformed and slightly different ones that must not be.
Now, this sounds like an awful lot of work! Yes it does, but there is help: there are several, commercial test-suites for (C-) compilers out in the world and experts that might help you in applying them. Here a small list of those known to me:
ACE: SuperTest Compiler Test and Validation Suite
Perennial C Compiler Validation Suite
Bugseng: ECLAIR
Tests in the GCC and LLVM environment
you name it! Or google "Compiler Test Suite"

First of all parsing is usually a trivial part of compiler project. From my experience it never takes more than 10% of the time (unless we are talking about C++ but you wouldn't be asking questions here if you were designing it) so you'd rather not invest much of your time into parser tests.
Still, TDD (or however you call it) has it's share in developing the middle-end where you often want to verify that e.g. optimizations that you've just added actually did result in expected code transformation. From my experience, tests like this are usually implemented by giving compiler specially crafted test programs and grepping output assembly for expected patterns (was this loop unrolled four times? did we manage to avoid memory writes is this function? etc.). Grepping assembly isn't as good as analyzing structured representation (S-exprs or XML) but it's cheap and works fine in most cases. It's awfully hard to support as your compiler grows though.

Related

Program written in generated code based on unit tests

As I was doing test driven development I pondered whether a hypothetical program could be completely developed by generated code based on tests. i.e. is there an ability to have a generator that creates the code specifically to pass tests. Would the future of programming languages just be to write tests?
I think this would be a tough one as, at least for the initial generations of such technology, developers would be very skeptical of generated code's correctness. So human review would have to be involved as well.
As a simple illustration of what I mean, suppose you write 10 tests for a function, with sample inputs and expected outputs covering every scenario you can think of. A program could trivially generate code which passed all of these tests with nothing more than a rudimentary switch statement (your ten inputs matched to their expected outputs). This code would obviously not be correct, but it would take a human to see that.
That's just a simple example. It isn't hard to imagine more sophisticated programs which might not generate a switch statement but still produce solutions that aren't actually correct, and which could be wrong in much more subtle ways. Hence my suggestion that any technology along these lines would be met with a deep level of skepticism, at least at first.
If code can be generated completely, then the basis of the generator would have to be a specification that exactly describes the code. This generator would then be something like a compiler that cross compiles one language into an other.
Tests are not such a language. They only assert that a specific aspect of the code functionality is valid and unchanged. By doing so they scaffold the code so that it does not break, even when it is refactored.
But how would I compare these two ways of development?
1) If the generator works correctly, then the specification is always transferred into correct code. I postulate that this code is tested by design and needs no additional test. Better TDD the generator than the generated code.
2) Whether you have a specification that leads to generated code or specifications expressed as tests that ensure that code works is quite equivalent in my eyes.
3) You can combine both ways of development. Generate a program framework with a tested generator from a specification and then enrich the generated code by using TDD. Attention: You then have two different development cycles running in one project. That means, you have to ensure that you always can regenerate the generated code when specifications change und that your additional code still correctly fits into the generated code.
Just one small example: Imagine a tool that can generate code from an UML class diagram. This could be done in an way that you can develop the methods with TDD, but the structure of the classes is defined in UML and you would not need to test this again.
While it's possible sometime in the future, simple tests can be used to generate code:
assertEquals(someclass.get_value(), true)
but getting the correct output from a black-box integration test is what I would guess is an NP-complete problem:
assertEquals(someclass.do_something(1), file_content(/some/file))
assertEquals(someclass.do_something(2), file_content(/some/file))
assertEquals(someclass.do_something(2), file_content(/some/file2))
assertEquals(someclass.do_something(3), file_content(/some/file2))
Does this mean that the resulting code will always write to /some/file? Does it mean that the resulting code should always write to /some/file2? Either could be true. What if it needs to only do the minimal set to get the tests to pass? Without knowing the context and writing very exact and bounding tests, no code could figure out (at this point in time) what the test author intended.

How to write good Unit Tests in Functional Programming

I'm using functions instead of classes, and I find that I can't tell when another function that it relies on is a dependency that should be individually unit-tested or an internal implementation detail that should not. How can you tell which one it is?
A little context: I'm writing a very simple Lisp interpreter which has an eval() function. It's going to have a lot of responsibilities, too many actually, such as evaluating symbols differently than lists (everything else evaluates to itself). When evaluating symbols, it has its own complex workflow (environment-lookup), and when evaluating lists, it's even more complicated, since the list can be a macro, function, or special-form, each of which have their own complex workflow and set of responsibilities.
I can't tell if my eval_symbol() and eval_list() functions should be considered internal implementation details of eval() which should be tested through eval()'s own unit tests, or genuine dependencies in their own right which should be unit-tested independently of eval()'s unit tests.
A significant motivation for the "unit test" concept is to control the combinatorial explosion of required test cases. Let's look at the examples of eval, eval_symbol and eval_list.
In the case of eval_symbol, we will want to test contingencies where the symbol's binding is:
missing (i.e. the symbol is unbound)
in the global environment
is directly within the current environment
inherited from a containing environment
shadowing another binding
... and so on
In the case of eval_list, we will want to test (among other things) what happens when the list's function position contains a symbol with:
no function or macro binding
a function binding
a macro binding
eval_list will invoke eval_symbol whenever it needs a symbol's binding (assuming a LISP-1, that is). Let's say that there are S test cases for eval_symbol and L symbol-related test cases for eval_list. If we test each of these functions separately, we could get away with roughly S + L symbol-related test cases. However, if we wish to treat eval_list as a black box and to test it exhaustively without any knowledge that it uses eval_symbol internally, then we are faced with S x L symbol-related test cases (e.g. global function binding, global macro binding, local function binding, local macro binding, inherited function binding, inherited macro binding, and so on). That's a lot more cases. eval is even worse: as a black box the number of combinations can become incredibly large -- hence the term combinatorial explosion.
So, we are faced with a choice of theoretical purity versus actual practicality. There is no doubt that a comprehensive set of test cases that exercises only the "public API" (in this case, eval) gives the greatest confidence that there are no bugs. After all, by exercising every possible combination we may turn up subtle integration bugs. However, the number of such combinations may be so prohibitively large as to preclude such testing. Not to mention that the programmer will probably make mistakes (or go insane) reviewing vast numbers of test cases that only differ in subtle ways. By unit-testing the smaller internal components, one can vastly reduce the number of required test cases while still retaining a high level of confidence in the results -- a practical solution.
So, I think the guideline for identifying the granularity of unit testing is this: if the number of test cases is uncomfortably large, start looking for smaller units to test.
In the case at hand, I would absolutely advocate testing eval, eval-list and eval-symbol as separate units precisely because of the combinatorial explosion. When writing the tests for eval-list, you can rely upon eval-symbol being rock solid and confine your attention to the functionality that eval-list adds in its own right. There are likely other testable units within eval-list as well, such as eval-function, eval-macro, eval-lambda, eval-arglist and so on.
My advice is quite simple: "Start somewhere!"
If you see a name of some def (or deffun) that looks like it might be fragile, well, you probably want to test it, don't you?
If you're having some trouble trying to figure out how your client code can interface with some other code unit, well, you probably want to write some tests somewhere that let you create examples of how to properly use that function.
If some function seems sensitive to data values, well, you might want to write some tests that not only verify it can handle any reasonable inputs properly, but also specifically exercise boundary conditions and odd or unusual data inputs.
Whatever seems bug-prone should have tests.
Whatever seems unclear should have tests.
Whatever seems complicated should have tests.
Whatever seems important should have tests.
Later, you can go about increasing your coverage to 100%. But you'll find that you will probably get 80% of your real results from the first 20% of your unit test coding (Inverted "Law of the Critical Few").
So, to review the main point of my humble approach, "Start somewhere!"
Regarding the last part of your question, I would recommend you think about any possible recursion or any additional possible reuse by "client" functions that you or subsequent developers might create in the future that would also call eval_symbol() or eval_list().
Regarding recursion, the functional programming style uses it a lot and it can be difficult to get right, especially for those of us who come from procedural or object-oriented programming, where recursion seems rarely encountered. The best way to get recursion right is to precisely target any recursive features with unit tests to make certain all possible recursive use cases are validated.
Regarding reuse, if your functions are likely to be invoked by anything other than a single use by your eval() function, they should probably be treated as genuine dependencies that deserve independent unit tests.
As a final hint, the term "unit" has a technical definition in the domain of unit testing as "the smallest piece of code software that can be tested in isolation.". That is a very old fundamental definition that may quickly clarify your situation for you.
This is somewhat orthogonal to the content of your question, but directly addresses the question posed in the title.
Idiomatic functional programming involves mostly side effect-free pieces of code, which makes unit testing easier in general. Defining a unit test typically involves asserting a logical property about the function under test, rather than building large amounts of fragile scaffolding just to establish a suitable test environment.
As an example, let's say we're testing extendEnv and lookupEnv functions as part of an interpreter. A good unit test for these functions would check that if we extend an environment twice with the same variable bound to different values, only the most recent value is returned by lookupEnv.
In Haskell, a test for this property might look like:
test =
let env = extendEnv "x" 5 (extendEnv "x" 6 emptyEnv)
in lookupEnv env "x" == Just 5
This test gives us some assurance, and doesn't require any setup or teardown other than creating the env value that we're interested in testing. However, the values under test are very specific. This only tests one particular environment, so a subtle bug could easily slip by. We'd rather make a more general statement: for all variables x and values v and w, an environment env extended twice with x bound to v after x is bound to w, lookupEnv env x == Just w.
In general, we need a formal proof (perhaps mechanized with a proof assistant like Coq, Agda, or Isabelle) in order to show that a property like this holds. However, we can get much closer than specifying test values by using QuickCheck, a library available for most functional languages that generates large amounts of arbitrary test input for properties we define as boolean functions:
prop_test x v w env' =
let env = extendEnv x v (extendEnv x w env')
in lookupEnv env x == Just w
At the prompt, we can have QuickCheck generate arbitrary inputs to this function, and see whether it remains true for all of them:
*Main> quickCheck prop_test
+++ OK, passed 100 tests.
*Main> quickCheckWith (stdArgs { maxSuccess = 1000 }) prop_test
+++ OK, passed 1000 tests.
QuickCheck uses some very nice (and extensible) magic to produce these arbitrary values, but it's functional programming that makes having those values useful. By making side effects the exception (sorry) rather than the rule, unit testing becomes less of a task of manually specifying test cases, and more a matter of asserting generalized properties about the behavior of your functions.
This process will surprise you frequently. Reasoning at this level gives your mind extra chances to notice flaws in your design, making it more likely that you'll catch errors before you even run your code.
I'm not really aware of any particular rule of thumb for this. But it seems like you should be asking yourself two questions:
Can you define the purpose of eval_symbol and eval_list without needing to say "part of the implementation of eval?
If you see a test fail for eval, would it be useful to to see whether any tests for eval_symbol and eval_list also fail?
If the answer to either of those is yes, I would test them separately.
Few months ago I wrote a simple "almost Lisp" interpreter in Python for an assignment. I designed it using Interpreter design pattern, unit tested the evaluation code. Then I added the printing and parsing code and transformed the test fixtures from abstract syntax representation (objects) to concrete syntax strings. Part of the assignment was to program simple recursive list processing functions, so I added them as functional tests.
To answer your question in general, the rules are pretty same like for OO. You should have all your public functions covered. In OO public methods are part of a class or an interface, in functional programming you most often have visibility control based around modules (similar to interfaces). Ideally, you would have full coverage for all functions, but if this isn't possible, consider TDD approach - start by writing tests for what you know you need and implement them. Auxilliary functions will be result of refactoring and as you wrote tests for everything important before, if tests work after refactoring, you are done and can write another test (iterate).
Good luck!

Effective testing of multiple students programming solutions

As part of my lecture in C++ the students will have to solve assignments. Each solution shall implement the same functions with the same functionality and the same parameters (function name, return value, passing parameters). Only the code inside is different.
So i'm thinking about a way to test all solutions (around 30) in an effective way. Maybe the best way is to write a unit test as well as a shell script (or something similar) that compiles each solution once with the unit test and runs it.
But maybe there is a different and much better solution to this problem.
The reason why unit tests are one of the most efficient types of automated testing is because the return of investment is relatively small (compared to other types of testing), so it makes perfect sense to me to write a verification suite of tests.
You might even go so far as to give the students the test suite instead of a specification written in prose. This could introduce them to the concept of Test-Driven Development (although we normally tend to write the tests iteratively, and not in batches).
yes, unit tests are the obvious solution for most cases.
compiler warnings and static analysis is also useful.
timing the program's execution given a set of parameters is another fairly automated option -- depends on what you are interested in evaluating.
creating base classes with good diagnostics (which you can swap out the implementation for your evaluation if you prefer) is another option. you can also provide interfaces they must use, and hold two implementations. then exercise the programs as usual using the diagnostic implementation. it depends on what you are looking for.
Maybe I'm missing something obvious, but wouldn't it be sufficient just to run the code several times with boundary-testing parameter values?

Unit testing for a compiler output

As part of a university project, we have to write a compiler for a toy language. In order to do some testing for this, I was considering how best to go about writing something like unit tests. As the compiler is being written in haskell, Hunit and quickcheck are both available, but perhaps not quite appropriate.
How can we do any kind of non-manual testing?
The only idea i've had is effectively compiling to haskell too, seeing what the output is, and using some shell script to compare this to the output of the compiled program - this is quite a bit of work, and isn't too elegant either.
The unit testing is to help us, and isn't part of assessed work itself.
This really depends on what parts of the compiler you are writing. It is nice if you can keep phases distinct to help isolate problems, but, in any phase, and even at the integration level, it is perfectly reasonable to have unit tests that consist of pairs of source code and hand-compiled code. You can start with the simplest legal programs possible, and ensure that your compiler outputs the same thing that you would if compiling by hand.
As complexity increases, and hand-compiling becomes unwieldy, it is helpful for the compiler to keep some kind of log of what it has done. Then you can consult this log to determine whether or not specific transformations or optimizations fired for a given source program.
Depending on your language, you might consider a generator of random programs from a collection of program fragments (in the QuickCheck vein). This generator can test your compiler's stability, and ability to deal with potentially unforeseen inputs.
The unit tests shall test small piece of code, typically one class or one function. The lexical and semantic analysis will each have their unit tests. The Intermediate Represetation generator will also have its own tests.
A unit test covers a simple test case: it invokes the function to be unit tested in a controlled environment and verify (assert) the result of the function execution. A unit test usually test one behavior only and has the following structure, called AAA :
Arrange: create the environment the function will be called in
Act: invoke the function
Assert: verify the result
Have a look at shelltestrunner. Here are some example tests. It is also being used in this compiler project.
One options is to the approach this guy is doing to test real compilers: get together with as many people as you can talk into it and each of you compiles and runs the same set of programs and then compare the outputs. Be sure to add every test case you use as more inputs makes it more effective. A little fun with automation and source control and you can make it fairly easy to maintain.
Be sure to get it OKed by the prof first but as you will only be sharing test cases and outputs I don't see where he will have much room to object.
Testing becomes more difficult once the output of your program goes to the console (such as standard output). Then you have to resort to some external tool, like grep or expect to check the output.
Keep the return values from your functions in data structures for as long as possible. If the output of your compiler is, say, assembly code, build a string in memory (or a list of strings) and output it at the last possible moment. That way you can test the contents of the strings more directly and quickly.

Help with TDD approach to a real world problem: linker

I'm trying to learn TDD. I've seen examples and discussions about how it's easy to TDD a coffee vending machine firmware from smallest possible functionality up. These examples are either primitive or very well thought-out, it's hard to tell right away. But here's a real world problem.
Linker.
A linker, at its simplest, reads one object file, does magic, and writes one executable file. I don't think I can simplify it further. I do believe the linker design may be evolved, but I have absolutely no idea where to start. Any ideas on how to approach this?
Well, probably the whole linker is too big a problem for the first unit test. I can envision some rough structure beforehand. What a linker does is:
Represents an object file as a collection of segments. Segments contain code, data, symbol definitions and references, debug information etc.
Builds a reference graph and decides which segments to keep.
Packs remaining segments into a contiguous address space according to some rules.
Relocates references.
My main problem is with bullet 1. 2, 3, and 4 basically take a regular data structure and convert it into a platform-dependent mess based on some configuration. I can design that, and the design looks feasible. But 1, it should pick a platform-dependent mess, in one of the several supported formats, and convert it into a regular structure.
The task looks generic enough. It happens everywhere you need to support multiple input formats, be it image processing, document processing, you name it. Is it possible to TDD ? It seems like either test is too simple and I easily hack it to green, or it's a bit more complex and I need to implement the whole object/image/document format reader which is a lot of code. And there is no middle ground.
First, have a look at "Growing Object Oriented Software Guided By Tests" by Freeman & Pryce.
Now, my attempt to answer a difficult question in a few lines.
TDD does require you to think (i.e. design) what you're going to do. You have to:
Think in small steps. Very small steps.
Write a short test, to prove that the next small piece of behaviour works.
Run the test to show that it fails
Do the simplest thing possible to get the test to pass
Refactor ruthlessly to remove duplication and improve the structure of the code
Run the test(s) again to make sure it all still works
Go back to 1.
An initial idea (design) of how your linker might be structured will guide your initial tests. The tests will enforce a modular design (because each test is only testing a single behaviour, and there should be minimal dependencies on other code you've written).
As you proceed you may find your ideas change. The tests you've already written will allow you to refactor with confidence.
The tests should be simple. It is easy to 'hack' a single test to green. But after each 'hack' you refactor. If you see the need for a new class or algorithm during the refactoring, then write tests to drive out its interface. Make sure that the tests only ever test a single behaviour by keeping your modules loosely coupled (dependency injection, abstract base classes, interfaces, function pointers etc.) and use fakes, stubs and mocks to isolate the code under test from the rest of your system.
Finally use 'customer' tests to ensure that you have delivered functional features.
It's a difficult change in mind-set, but a lot of fun and very rewarding. Honest.
You're right, a linker seems a bit bigger than a 'unit' to me, and TDD does not excuse you from sitting down and thinking about how you're going to break down your problem into units. The Sudoku saga is a good illustration of what goes wrong if you don't think first!
Concentrating on your point 1, you have already described a good collection of units (of functionality) by listing the kinds of things that can appear in segments, and hinting that you need to support multiple formats. Why not start by dealing with a simple case like, say, a file containing just a data segment in the binary format of your development platform? You could simply hard-code the file as a binary array in your test, and then check that it interprets just that correctly. Then pick another simple case, and test for that. Keep going.
Now the magic bit is that pretty soon you'll see repeated structures in your code and in your tests, and because you've got tests you can be quite aggressive about refactoring it away. I suspect this is the bit that you haven't experienced yet, because you say "It seems like either test is too simple and I easily hack it to green, or it's a bit more complex and I need to implement the whole object/image/document format reader which is a lot of code. And there is no middle ground." The point is that you should hack them all to green, but as you're doing that you are also searching out the patterns in your hacks.
I wrote a (very simple) compiler in this fashion, and it mostly worked quite well. For each syntactic construction, I wrote the smallest program that I could think of which used it in some observable way, and had the test compile the program and check that it worked as expected. I used a proper parser generator as you can't plausibly TDD your way into one of them (you need to use a little forethought!) After about three cycles, it became obvious that I was repeating the code to walk the syntax tree, so that was refactored into something like a Visitor.
I also had larger-scale acceptance tests, but in the end I don't think these caught much that the unit tests didn't.
This is all very possible.
A sample from the top of my head is NHAML.
This is ASP.NET ViewEngine that converts plain text to the .NET native code.
You can have a look at source code and see how it is tested.
I guess what I do is come up with layers and blocks and sub-divide to the point where I might be thinking about code and then start writing tests.
I think your tests should be quite simple: it's not the individual tests that are the power of TDD but the sum of the tests.
One of the principles I follow is that a method should fit on a screen - when that's the case, the tests are usually simple enough.
Your design should allow you to mock out lower layers so that you're only testing one layer.
TDD is about specification, not test.
From your simplest spec of a linker, your TDD test has just to check whether an executable file has been created during the linker magic if you feed it with an object file.
Then you write a linker that makes your test succeed, e.g.:
check whether input file is an object file
if so, generate a "Hello World!" executable (note that your spec didn't specify that different object files would produce different executables)
Then you refine your spec and your TDD (these are your four bullets).
As long as you can write a specification you can write TDD test cases.