Program written in generated code based on unit tests

Program written in generated code based on unit tests - unit-testing

As I was doing test driven development I pondered whether a hypothetical program could be completely developed by generated code based on tests. i.e. is there an ability to have a generator that creates the code specifically to pass tests. Would the future of programming languages just be to write tests?

I think this would be a tough one as, at least for the initial generations of such technology, developers would be very skeptical of generated code's correctness. So human review would have to be involved as well.
As a simple illustration of what I mean, suppose you write 10 tests for a function, with sample inputs and expected outputs covering every scenario you can think of. A program could trivially generate code which passed all of these tests with nothing more than a rudimentary switch statement (your ten inputs matched to their expected outputs). This code would obviously not be correct, but it would take a human to see that.
That's just a simple example. It isn't hard to imagine more sophisticated programs which might not generate a switch statement but still produce solutions that aren't actually correct, and which could be wrong in much more subtle ways. Hence my suggestion that any technology along these lines would be met with a deep level of skepticism, at least at first.

If code can be generated completely, then the basis of the generator would have to be a specification that exactly describes the code. This generator would then be something like a compiler that cross compiles one language into an other.
Tests are not such a language. They only assert that a specific aspect of the code functionality is valid and unchanged. By doing so they scaffold the code so that it does not break, even when it is refactored.
But how would I compare these two ways of development?
1) If the generator works correctly, then the specification is always transferred into correct code. I postulate that this code is tested by design and needs no additional test. Better TDD the generator than the generated code.
2) Whether you have a specification that leads to generated code or specifications expressed as tests that ensure that code works is quite equivalent in my eyes.
3) You can combine both ways of development. Generate a program framework with a tested generator from a specification and then enrich the generated code by using TDD. Attention: You then have two different development cycles running in one project. That means, you have to ensure that you always can regenerate the generated code when specifications change und that your additional code still correctly fits into the generated code.
Just one small example: Imagine a tool that can generate code from an UML class diagram. This could be done in an way that you can develop the methods with TDD, but the structure of the classes is defined in UML and you would not need to test this again.

While it's possible sometime in the future, simple tests can be used to generate code:
assertEquals(someclass.get_value(), true)
but getting the correct output from a black-box integration test is what I would guess is an NP-complete problem:
assertEquals(someclass.do_something(1), file_content(/some/file))
assertEquals(someclass.do_something(2), file_content(/some/file))
assertEquals(someclass.do_something(2), file_content(/some/file2))
assertEquals(someclass.do_something(3), file_content(/some/file2))
Does this mean that the resulting code will always write to /some/file? Does it mean that the resulting code should always write to /some/file2? Either could be true. What if it needs to only do the minimal set to get the tests to pass? Without knowing the context and writing very exact and bounding tests, no code could figure out (at this point in time) what the test author intended.

Related

Should we unit test what method actually does, or what it should do?

The question might seem a bit weird, but i'll explain it.
Consider following:
We have a service FirstNameValidator, which i created for other developers so they have a consistent way to validate a person's first name. I want to test it, but because the full set of possible inputs is infinite (or very very big), i only test few cases:
Assert.IsTrue(FirstNameValidator.Validate("John"))
Assert.IsFalse(FirstNameValidator.Validate("$$123"))
I also have LastNameValidator, which is 99% identical, and i wrote a test for it too:
Assert.IsTrue(LastNameValidator.Validate("Doe"))
Assert.IsFalse(LastNameValidator.Validate("__%%"))
But later a new structure appeared - PersonName, which consists of first name and last name. We want to validate it too, so i create a PersonNameValidator. Obviously, for reusability i just call FirstNameValidator and LastNameValidator. Everything is fine till i want to write a test for it.
What should i test?
The fact that FirstNameValidator.Validate was actually called with correct argument?
Or i need to create few cases and test them?
That is actually the question - should we test what service is expected to do? It is expected to validate PersonName, how it does it we actually don't care. So we pass few valid and invalid inputs and expect corresponding return values.
Or, maybe, what it actually does? Like it actually just calls other validators, so test that (.net mocking framework allows it).

Unit tests should be acceptance criteria for a properly functioning unit of code...
they should test what the code should and shouldn't do, you will often find corner cases when you are writing tests.
If you refactor code, you often will have to refactor tests... This should be viewed as part of the original effort, and should bring glee to your soul as you have made the product and process an improvement of such magnitude.
of course if this is a library with outside (or internal, depending on company culture) consumers, you have documentation to consider before you are completely done.
edit: also those tests are pretty weak, you should have a definition of what is legal in each, and actually test inclusion and exclusion of at least all of the classes of glyphps... they can still use related code for testing... ie isValidUsername(name,allowsSpace) could work for both first name and whole name depending on if spaces are allowed.

You have formulated your question a bit strangely: Both options that you describe would test that the function behaves as it should - but in each case on a different level of granularity. In one case you would test the behaviour based on the API that is available to a user of the function. Whether and how the function implements its functionality with the help of other functions/components is not relevant. In the second case you test the behaviour in isolation, including the way the function interacts with its dependended-on components.
On a general level it is not possible to say which is better - depending on the circumstances each option may be the best. In general, isolating a piece of software requires usually more effort to implement the tests and makes the tests more fragile against implementation changes. That means, going for isolation should only be done in situations where there are good reasons for it. Before getting to your specific case, I will describe some situations where isolation is recommendable.
With the original depended-on component (DOC), you may not be able to test everything you want. Assume your code does error handling for the case the DOC returns an error code. But, if the DOC can not easily be made to return an error code, you have difficulty to test your error handling. In this case, if you double the DOC you could make the double return an error code, and thus also test your error handling.
The DOC may have non-deterministic or system-specific behaviour. Some examples are random number generators or date and time functions. If this makes testing your functions difficult, it would be an argument to replace the DOC with some double, so you have control over what is provided to your functions.
The DOC may require a very complex setup. Imagine a complex data base or some complex xml document that needs to be provided. For one thing, this can make your setup quite complicated, for another, your tests get fragile and will likely break if the DOC changes (think that the XML schema changes...).
The setup of the DOC or the calls to the DOC are very time consuming (imagine reading data from a network connection, computing the next chess move, solving the TSP, ...). Or, the use of the DOC prolongs compilation or linking significantly. With a double you can possibly shorten the execution or build time significantly, which is the more interesting the more often you are executing / building the tests.
You may not have a working version of the DOC - possibly the DOC is still under development and is not yet available. Then, with doubles you can start testing nevertheless.
The DOC may be immature, such that with the version you have your tests are instable. In such a case it is likely that you lose trust in your test results and start ignoring failing tests.
The DOC itself may have other dependencies which have some of the problems described above.
These criteria can help to come to an informed decision about whether isolation is necessary. Considering your specific example: The way you have described the situation I get the impression that none of the above criteria is fulfilled. Which for me would lead to the conclusion that I would not isolate the function PersonNameValidator from its DOCs FirstNameValidator and LastNameValidator.

Unit testing a compiler

What is considered the best approach to unit test a complex unit such as a compiler?
I've written a few compilers and interpreters over the years, and I do find this kind of code quite hard to test in a good way.
If we take something like the Abstract Syntax Tree generation. how would you test this using TDD?
Small constructs might be easy to test.
e.g. something along the lines of:
string code = #"public class Foo {}";
AST ast = compiler.Parse(code);
Since that won't generate alot of ast nodes.
But if I actually want to test that the compiler can generate an AST for something like a method:
[TestMethod]
public void Can_parse_integer_instance_method_in_class ()
{
string code = #"public class Foo { public int method(){ return 0;}}";
AST ast = compiler.Parse(code);
What would you assert on?
Manually defining an AST that represents the given code and make an assertion that the generated AST conforms to the manually defined AST seems horribly combersome and might even be error prone.
So what are the best tactics for TDD'ing complex scenarios like this?

Firstly, if you test a compiler, you cannot get enough tests! Users really rely on compiler-generated output as if it were an always-golden-standard, so really be aware of quality. So if you can, test with every test you can come up with!
Secondly, use all testing methods available, yet use them where appropriate. Indeed, you may be able to mathematically prove, that a certain transformation is correct. If you are able to do so, you should.
But every compiler that I've seen some internals of involves heuristics and a lot of optimized, hand-crafted code at its internals; thus assisted proving methods typically are not applicable any more. Here, testing comes in place, and I mean a lot of it!
When collecting tests, please consider different cases:
Positive Standard-Conformance: your frontend should accept certain code patterns and the compiler must produce a correctly running program thereof. Tests in this category either need a golden-reference compiler or generator that produces the correct output of the test-program; or it involves hand-written programs that include a check against values furnished by human reasoning.
Negative Tests: every compiler has to reject faulty code, like syntax errors, type mismatches et cetera. It must produce certain types of error and warning messages. I don't know of any method to auto-generate such tests. So these need to be human-written, too.
Transformation tests: whenever you come up with a fancy optimization within your compiler (middle-end), you probably have some example code in mind that demonstrates the optimization. Be aware of transformations before and after such a module, they might need special options to your compiler or a bare-bone compiler with just that module plugged-in. Test a reasonable big set of surrounding module combinations, too. I usually did regression-testing on the intermediate representation before and after the specific transformation, defining a reference by intensive reasoning with colleagues. Try to write code on both sides of the transformation, i.e. code snippets that you want to have transformed and slightly different ones that must not be.
Now, this sounds like an awful lot of work! Yes it does, but there is help: there are several, commercial test-suites for (C-) compilers out in the world and experts that might help you in applying them. Here a small list of those known to me:
ACE: SuperTest Compiler Test and Validation Suite
Perennial C Compiler Validation Suite
Bugseng: ECLAIR
Tests in the GCC and LLVM environment
you name it! Or google "Compiler Test Suite"

First of all parsing is usually a trivial part of compiler project. From my experience it never takes more than 10% of the time (unless we are talking about C++ but you wouldn't be asking questions here if you were designing it) so you'd rather not invest much of your time into parser tests.
Still, TDD (or however you call it) has it's share in developing the middle-end where you often want to verify that e.g. optimizations that you've just added actually did result in expected code transformation. From my experience, tests like this are usually implemented by giving compiler specially crafted test programs and grepping output assembly for expected patterns (was this loop unrolled four times? did we manage to avoid memory writes is this function? etc.). Grepping assembly isn't as good as analyzing structured representation (S-exprs or XML) but it's cheap and works fine in most cases. It's awfully hard to support as your compiler grows though.

Unit testing for a compiler output

As part of a university project, we have to write a compiler for a toy language. In order to do some testing for this, I was considering how best to go about writing something like unit tests. As the compiler is being written in haskell, Hunit and quickcheck are both available, but perhaps not quite appropriate.
How can we do any kind of non-manual testing?
The only idea i've had is effectively compiling to haskell too, seeing what the output is, and using some shell script to compare this to the output of the compiled program - this is quite a bit of work, and isn't too elegant either.
The unit testing is to help us, and isn't part of assessed work itself.

This really depends on what parts of the compiler you are writing. It is nice if you can keep phases distinct to help isolate problems, but, in any phase, and even at the integration level, it is perfectly reasonable to have unit tests that consist of pairs of source code and hand-compiled code. You can start with the simplest legal programs possible, and ensure that your compiler outputs the same thing that you would if compiling by hand.
As complexity increases, and hand-compiling becomes unwieldy, it is helpful for the compiler to keep some kind of log of what it has done. Then you can consult this log to determine whether or not specific transformations or optimizations fired for a given source program.
Depending on your language, you might consider a generator of random programs from a collection of program fragments (in the QuickCheck vein). This generator can test your compiler's stability, and ability to deal with potentially unforeseen inputs.

The unit tests shall test small piece of code, typically one class or one function. The lexical and semantic analysis will each have their unit tests. The Intermediate Represetation generator will also have its own tests.
A unit test covers a simple test case: it invokes the function to be unit tested in a controlled environment and verify (assert) the result of the function execution. A unit test usually test one behavior only and has the following structure, called AAA :
Arrange: create the environment the function will be called in
Act: invoke the function
Assert: verify the result

Have a look at shelltestrunner. Here are some example tests. It is also being used in this compiler project.

One options is to the approach this guy is doing to test real compilers: get together with as many people as you can talk into it and each of you compiles and runs the same set of programs and then compare the outputs. Be sure to add every test case you use as more inputs makes it more effective. A little fun with automation and source control and you can make it fairly easy to maintain.
Be sure to get it OKed by the prof first but as you will only be sharing test cases and outputs I don't see where he will have much room to object.

Testing becomes more difficult once the output of your program goes to the console (such as standard output). Then you have to resort to some external tool, like grep or expect to check the output.
Keep the return values from your functions in data structures for as long as possible. If the output of your compiler is, say, assembly code, build a string in memory (or a list of strings) and output it at the last possible moment. That way you can test the contents of the strings more directly and quickly.

Is my code really not unit-testable?

A lot of code in a current project is directly related to displaying things using a 3rd-party 3D rendering engine. As such, it's easy to say "this is a special case, you can't unit test it". But I wonder if this is a valid excuse... it's easy to think "I am special" but rarely actually the case.
Are there types of code which are genuinely not suited for unit-testing? By suitable, I mean "without it taking longer to figure out how to write the test than is worth the effort"... dealing with a ton of 3D math/rendering it could take a lot of work to prove the output of a function is correct compared with just looking at the rendered graphics.

Code that directly relates to displaying information, generating images and even general UI stuff, is sometimes hard to unit-test.
However that mostly applies only to the very top level of that code. Usually 1-2 method calls below the "surface" is code that's easily unit tested.
For example, it may be nontrivial to test that some information is correctly animated into the dialog box when a validation fails. However, it's very easy to check if the validation would fail for any given input.
Make sure to structure your code in a way that the "non-testable" surface area is well-separated from the test and write extensive tests for the non-surface code.

The point of unit-testing your rendering code is not to demonstrate that the third-party-code does the right thing (that is for integration and regression testing). The point is to demonstrate that your code gives the right instructions to the third-party code. In other words, you only have to control the input of your code layer and verify the output (which would become the input of the renderer).
Of course, you can create a mock version of the renderer which does cheap ASCII graphics or something, and then verify the pseudo-graphics if you want and this makes the test clearer if you want, but it is not strictly necessary for a unit test of your code.

If you cannot break your code into units, it is very hard to unit test.
My guess would be that if you have 3D atomic functions (say translate, rotate,
and project a point) they should be easily testable - create a set of test points and test whether the transformation takes a point to where it should.
If you can only reach the 3D code through a limited API, then it would be hard to test.
Please see Misko Hevery's Testability posts and his testability guide.

I think this is a good question. I wrestle with this all the time, and it seems like there are certain types of code that fit into the unit testing paradigm and other types that do not.
What I consider clearly unit-testable is code that obviously has room for being wrong. Examples:
Code to compute hairy math or linear algebra functions. I always write an auxiliary function to check the answers, and run it once in a while.
Hairy data structure code, with cross-references, guids, back-pointers, and methods for incrementally keeping it consistent. These are really easy to break, so unit tests are good for seeing if they are broken.
On the other hand, in code with low redundancy, if the code compiles it may not be clear what being wrong even means. For example, I do pretty complicated UIs using dynamic dialogs, and it's not clear what to test. All the kinds of things like event handling, layout, and showing / hiding / updating of controls that might make this code error-prone are simply dealt with in a well-verified layer underneath.
The kind of testing I find myself needing more than unit-testing is coverage testing. Have I tried all the possible features and combinations of features? Since this is a very large space and it is prohibitive to write automated tests to cover it, I often find myself doing monte-carlo testing instead, where feature selections are chosen at random and submitted to the system. Then the result is examined in an automatic and / or manual way.

If you can grab the rendered image, you can unit test it.
Simply render some images with the current codebase, see if they "look right" (examining them down to the pixel if you have to), and store them for comparison. Your unit tests could then compare to those stored images and see if the result is the same.
Whether or not this is worth the trouble, that's for you to decide.

Break down the rendering into steps and test by comparing the frame buffer for each step to a known good images.
No matter what you have, it can be broken down to numbers which can be compared. The real trick is when you havbe some random number generator in the algorithm, or some other nondeterministic part.
With things like floating point, you might need to subtract the generated data from the expected data and check that the difference is less than some error threshold.

Well you can't unit test certain kinds of exception code but other than that ...
I've got true unit tests for some code that looks impossible to even attach a test harness to and code that looks like it should be unit testable but isn't.
One of the ways you know your code is not unit testable is when it depends on the physical characteristics of the device it runs on. Another kind of not unit-testable code is direct UI code (and I find a lot of breaks in direct UI code).
I've also got a huge chunk of non unit-testable code that has appropriate integration tests.

How should I unit test a code-generator?

This is a difficult and open-ended question I know, but I thought I'd throw it to the floor and see if anyone had any interesting suggestions.
I have developed a code-generator that takes our python interface to our C++ code (generated via SWIG) and generates code needed to expose this as WebServices. When I developed this code I did it using TDD, but I've found my tests to be brittle as hell. Because each test essentially wanted to verify that for a given bit of input code (which happens to be a C++ header) I'd get a given bit of outputted code I wrote a small engine that reads test definitions from XML input files and generates test cases from these expectations.
The problem is I dread going in to modify the code at all. That and the fact that the unit tests themselves are a: complex, and b: brittle.
So I'm trying to think of alternative approaches to this problem, and it strikes me I'm perhaps tackling it the wrong way. Maybe I need to focus more on the outcome, IE: does the code I generate actually run and do what I want it to, rather than, does the code look the way I want it to.
Has anyone got any experiences of something similar to this they would care to share?

I started writing up a summary of my experience with my own code generator, then went back and re-read your question and found you had already touched upon the same issues yourself, focus on the execution results instead of the code layout/look.
Problem is, this is hard to test, the generated code might not be suited to actually run in the environment of the unit test system, and how do you encode the expected results?
I've found that you need to break down the code generator into smaller pieces and unit test those. Unit testing a full code generator is more like integration testing than unit testing if you ask me.

Recall that "unit testing" is only one kind of testing. You should be able to unit test the internal pieces of your code generator. What you're really looking at here is system level testing (a.k.a. regression testing). It's not just semantics... there are different mindsets, approaches, expectations, etc. It's certainly more work, but you probably need to bite the bullet and set up an end-to-end regression test suite: fixed C++ files -> SWIG interfaces -> python modules -> known output. You really want to check the known input (fixed C++ code) against expected output (what comes out of the final Python program). Checking the code generator results directly would be like diffing object files...

Yes, results are the ONLY thing that matters. The real chore is writing a framework that allows your generated code to run independently... spend your time there.

If you are running on *nux you might consider dumping the unittest framework in favor of a bash script or makefile. on windows you might consider building a shell app/function that runs the generator and then uses the code (as another process) and unittest that.
A third option would be to generate the code and then build an app from it that includes nothing but a unittest. Again you would need a shell script or whatnot to run this for each input. As to how to encode the expected behavior, it occurs to me that it could be done in much the same way as you would for the C++ code just using the generated interface rather than the C++ one.

Just wanted to point out that you can still achieve fine-grained testing while verifying the results: you can test individual chunks of code by nesting them inside some setup and verification code:
int x = 0;
GENERATED_CODE
assert(x == 100);
Provided you have your generated code assembled from smaller chunks, and the chunks do not change frequently, you can exercise more conditions and test a little better, and hopefully avoid having all your tests break when you change specifics of one chunk.

Unit testing is just that testing a specific unit. So if you are writing a specification for class A, it is ideal if class A does not have the real concrete versions of class B and C.
Ok I noticed afterwards the tag for this question includes C++ / Python, but the principles are the same:
public class A : InterfaceA
{
InterfaceB b;
InterfaceC c;
public A(InterfaceB b, InterfaceC c) {
this._b = b;
this._c = c; }
public string SomeOperation(string input)
{
return this._b.SomeOtherOperation(input)
+ this._c.EvenAnotherOperation(input);
}
}
Because the above System A injects interfaces to systems B and C, you can unit test just system A, without having real functionality being executed by any other system. This is unit testing.
Here is a clever manner for approaching a System from creation to completion, with a different When specification for each piece of behaviour:
public class When_system_A_has_some_operation_called_with_valid_input : SystemASpecification
{
private string _actualString;
private string _expectedString;
private string _input;
private string _returnB;
private string _returnC;
[It]
public void Should_return_the_expected_string()
{
_actualString.Should().Be.EqualTo(this._expectedString);
}
public override void GivenThat()
{
var randomGenerator = new RandomGenerator();
this._input = randomGenerator.Generate<string>();
this._returnB = randomGenerator.Generate<string>();
this._returnC = randomGenerator.Generate<string>();
Dep<InterfaceB>().Stub(b => b.SomeOtherOperation(_input))
.Return(this._returnB);
Dep<InterfaceC>().Stub(c => c.EvenAnotherOperation(_input))
.Return(this._returnC);
this._expectedString = this._returnB + this._returnC;
}
public override void WhenIRun()
{
this._actualString = Sut.SomeOperation(this._input);
}
}
So in conclusion, a single unit / specification can have multiple behaviours, and the specification grows as you develop the unit / system; and if your system under test depends on other concrete systems within it, watch out.

My recommendation would be to figure out a set of known input-output results, such as some simpler cases that you already have in place, and unit test the code that is produced. It's entirely possible that as you change the generator that the exact string that is produced may be slightly different... but what you really care is whether it is interpreted in the same way. Thus, if you test the results as you would test that code if it were your feature, you will find out if it succeeds in the ways you want.
Basically, what you really want to know is whether your generator will produce what you expect without physically testing every possible combination (also: impossible). By ensuring that your generator is consistent in the ways you expect, you can feel better that the generator will succeed in ever-more-complex situations.
In this way, you can also build up a suite of regression tests (unit tests that need to keep working correctly). This will help you make sure that changes to your generator aren't breaking other forms of code. When you encounter a bug that your unit tests didn't catch, you may want to include it to prevent similar breakage.

I find that you need to test what you're generating more than how you generate it.
In my case, the program generates many types of code (C#, HTML, SCSS, JS, etc.) that compile into a web application. The best way I've found to reduce regression bugs overall is to test the web application itself, not by testing the generator.
Don't get me wrong, there are still unit tests checking out some of the generator code, but our biggest bang for our buck has been UI tests on the generated app itself.
Since we're generating it we also generate a nice abstraction in JS we can use to programatically test the app. We followed some ideas outlined here: http://code.tutsplus.com/articles/maintainable-automated-ui-tests--net-35089
The great part is that it really tests your system end-to-end, from code generation out to what you're actually generating. Once a test fails, its easy to track it back to where the generator broke.
It's pretty sweet.
Good luck!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js