How to test scientific software? - unit-testing

I'm convinced that software testing indeed is very important, especially in science. However, over the last 6 years, I never have come across any scientific software project which was under regular tests (and most of them were not even version controlled).
Now I'm wondering how you deal with software tests for scientific codes (numerical computations).
From my point of view, standard unit tests often miss the point, since there is no exact result, so using assert(a == b) might prove a bit difficult due to "normal" numerical errors.
So I'm looking forward to reading your thoughts about this.

I am also in academia and I have written quantum mechanical simulation programs to be executed on our cluster. I made the same observation regarding testing or even version control. I was even worse: in my case I am using a C++ library for my simulations and the code I got from others was pure spaghetti code, no inheritance, not even functions.
I rewrote it and I also implemented some unit testing. You are correct that you have to deal with the numerical precision, which can be different depending on the architecture you are running on. Nevertheless, unit testing is possible, as long as you are taking these numerical rounding errors into account. Your result should not depend on the rounding of the numerical values, otherwise you would have a different problem with the robustness of your algorithm.
So, to conclude, I use unit testing for my scientific programs, and it really makes one more confident about the results, especially with regards to publishing the data in the end.

Just been looking at a similar issue (google: "testing scientific software") and came up with a few papers that may be of interest. These cover both the mundane coding errors and the bigger issues of knowing if the result is even right (depth of the Earth's mantle?)
http://http.icsi.berkeley.edu/ftp/pub/speech/papers/wikipapers/cox_harris_testing_numerical_software.pdf
http://www.cs.ua.edu/~SECSE09/Presentations/09_Hook.pdf (broken link; new link is http://www.se4science.org/workshops/secse09/Presentations/09_Hook.pdf)
http://www.associationforsoftwaretesting.org/?dl_name=DianeKellyRebeccaSanders_TheChallengeOfTestingScientificSoftware_paper.pdf
I thought the idea of mutation testing described in 09_Hook.pdf (see also matmute.sourceforge.net) is particularly interesting as it mimics the simple mistakes we all make. The hardest part is to learn to use statistical analysis for confidence levels, rather than single pass code reviews (man or machine).
The problem is not new. I'm sure I have an original copy of "How accurate is scientific software?" by Hatton et al Oct 1994, that even then showed how different implementations of the same theories (as algorithms) diverged rather rapidly (It's also ref 8 in Kelly & Sanders paper)
--- (Oct 2019)
More recently Testing Scientific Software: A Systematic Literature Review

I'm also using cpptest for its TEST_ASSERT_DELTA. I'm writing high-performance numerical programs in computational electromagnetics and I've been happily using it in my C++ programs.
I typically go about testing scientific code the same way as I do with any other kind of code, with only a few retouches, namely:
I always test my numerical codes for cases that make no physical sense and make sure the computation actually stops before producing a result. I learned this the hard way: I had a function that was computing some frequency responses, then supplied a matrix built with them to another function as arguments which eventually gave its answer a single vector. The matrix could have been any size depending on how many terminals the signal was applied to, but my function was not checking if the matrix size was consistent with the number of terminals (2 terminals should have meant a 2 x 2 x n matrix); however, the code itself was wrapped so as not to depend on that, it didn't care what size the matrices were since it just had to do some basic matrix operations on them. Eventually, the results were perfectly plausible, well within the expected range and, in fact, partially correct -- only half of the solution vector was garbled. It took me a while to figure. If your data looks correct, it's assembled in a valid data structure and the numerical values are good (e.g. no NaNs or negative number of particles) but it doesn't make physical sense, the function has to fail gracefully.
I always test the I/O routines even if they are just reading a bunch of comma-separated numbers from a test file. When you're writing code that does twisted math, it's always tempting to jump into debugging the part of the code that is so math-heavy that you need a caffeine jolt just to understand the symbols. Days later, you realize you are also adding the ASCII value of \n to your list of points.
When testing for a mathematical relation, I always test it "by the book", and I also learned this by example. I've seen code that was supposed to compare two vectors but only checked for equality of elements and did not check for equality of length.

Please take a look at the answers to the SO question How to use TDD correctly to implement a numerical method?

Related

Recommendable test for software integer multiplication?

I've coded a number of integer multiplication routines for Atmel's AVR architecture. I found following a simple pattern for the multiplier (and a similar one for the multiplicand) useful, if unconvincing (start at zero, step by a one in every byte (in addition to eventual carries)).
There seems to be quite a bit about testing hardware multiplier implementations, but:
What can be recommended for testing software implementations of integer multiplication? Exhaustive testing gets out of hand - if not at, then beyond 16×16 bit.
Most approaches uses Genere & Test
generate test operands
The safest is to use all combinations of operands. But with bignums or small computing power or memory is this not possible or practical. In that case are usually used some test cases which will (most likely) stress the algorithm tested (like propagating carry ... or be near safe limits) and use only them. Another option is to use pseudo random operands and hope for a valid test sample. or combine all these in some way together
compute multiplication
Just apply the tested algorithm on generated input data.
assess the results
so you need to compare the algorithm(multiplication) result to something. Mostly use different algorithm of the same process to compare with. Or use inverse functions (like c=a*b; if (a!=c/b) ...). The problem with this is that you can not distinguish between error in compared or compared to algorithm ... unless you use something 100% working or compare to more then just one operation... There is also the possibility to have precomputed result table (computed on different platform 100% bug free) and compare to that.
AVR Tag make this a tough question. So some hints:
many MCU's have a lot of Flash memory for program that can be used to store precomputed values to compare to
sometimes running in emulator is faster/more comfortable but with the risk of dismissing some HW related issues.

About optimized math functions, ranges and intervals

I'm trying to wrap my head around how people that code math functions for games and rendering engines can use an optimized math function in an efficient way; let me explain that further.
There is an high need for fast trigonometric functions in those fields, at times you can optimize a sin, a cos or other functions by rewriting them in a different form that is valid only for a given interval, often times this means that your approximation of f(x) is just for the first quadrant, meaning 0 <= x <= pi/2 .
Now the input for your f(x) is still about all 4 quadrants, but the real formula only covers 1/4 of that interval, the straightforward solution is to detect the quadrant by analyzing the input and see in which range it belongs to, then you adjust the result of the formula accordingly if the input comes from a quadrant that is not the first quadrant .
This is good in theory but this also presents a couple of really bad problems, especially considering the fact that you are doing all this to steal a couple of cycles from your CPU ( you also get a consistent implementation, that is not platform dependent like an hardcoded fsin in Intel x86 asm that only works on x86 and has a certain error range, all of this may differ on other platforms with other asm instructions ), so you should keep things working at a concurrent and high performance level .
The reason I can't wrap my head around the "switch case" with quadrants solution is:
it just prevents possible optimizations, namely memoization, considering that you usually want to put that switch-case inside the same functions that actually computes the f(x), probably the situation can be improved by implementing the formula for f(x) outside said function, but this is will lead to a doubling in the number of functions to maintain for any given math library
increase probability of more branching with a concurrent execution
generally speaking doesn't lead to better, clean, dry code, and conditional statements are often times a potential source of bugs, I don't really like switch-cases and similar things .
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
Note: In the below answer I am speaking very generally about code.
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
The general answer to this is: In the most obvious and simplest way possible that achieves your purpose.
I'm not entirely sure I follow most of your arguments/questions but I have a feeling you are looking for problems where really none exist. Do you truly have the need to re-implement the trigonometric functions? Don't fall into the trap of NIH (Not Invented Here).
the straightforward solution is to detect the quadrant
Yes! I love straightforward code. Code that is perfectly obvious at a glance what it does. Now, sometimes, just sometimes, you have to do some crazy things to get it to do what you want: for performance, or avoiding bugs out of your control. But the first version should be most obvious and simple code that solves your problem. From there you do testing, profiling, and benchmarking and if (only if) you find performance or other issues, then you go into the crazy stuff.
This is good in theory but this also presents a couple of really bad problems,
I would say that this is good in theory and in practice for most cases and I definitely don't see any "bad" problems. Minor issues in specific corner cases or design requirements at most.
A few things on a few of your specific comments:
approximation of f(x) is just for the first quadrant: Yes, and there are multiple reasons for this. One simply is that most trigonometric functions have identities so you can easily use these to reduce range of input parameters. This is important as many numerical techniques only work over a specific range of inputs, or are more accurate/performant for small inputs. Next, for very large inputs you'll have to trim the range anyways for most numerical techniques to work or at least work in a decent amount of time and have sufficient accuracy. For example, look at the Taylor expansion for cos() and see how long it takes to converge sufficiently for large vs small inputs.
it just prevents possible optimizations: Chances are your c++ compiler these days is way better at optimizations than you are. Sometimes it isn't but the general procedure is to let the compiler do its optimization and only do manual optimizations where you have measured and proven that you need it. Theses days it is very non-intuitive to tell what code is faster by just looking at it (you can read all the questions on SO about performance issues and how crazy some of the root causes are).
namely memoization: I've never seen memoization in place for a double function. Just think how many doubles are there between 0 and 1. Now in reduced accuracy situations you can take advantage of it but this is easily implemented as a custom function tailored for that exact situation. Thinking about it, I'm not exactly sure how to implement memoization for a double function that actually means anything and doesn't loose accuracy or performance in the process.
increase probability of more branching with a concurrent execution: I'm not sure I'd implement trigonometric functions in a concurrent manner but I suppose its entirely possible to get some performance benefits. But again, the compiler is generally better at optimizations than you so let it do its jobs and then benchmark/profile to see if you really need to do better.
doesn't lead to better, clean, dry code: I'm not sure what exactly you mean here, or what "dry code" is for that matter. Yes, sometimes you can get into trouble by too many or too complex if/switch blocks but I can't see a simple check for 4 quadrants apply here...it's a pretty basic and simple case.
So for any platform I get the same y for the same values of x: My guess is that getting "exact" values for all 53 bits of double across multiple platforms and systems is not going to be possible. What is the result if you only have 52 bits correct? This would be a great area to do some tests in and see what you get.
I've used trigonometric functions in C for over 20 years and 99% of the time I just use whatever built-in function is supplied. In the rare case I need more performance (or accuracy) as proven by testing or benchmarking, only then do I actually roll my own custom implementation for that specific case. I don't rewrite the entire gamut of <math.h> functions in the hope that one day I might need them.
I would suggest try coding a few of these functions in as many ways as you can find and do some accuracy and benchmark tests. This will give you some practical knowledge and give you some hard data on whether you actually need to reimplement these functions or not. At the very least this should give you some practical experience with implementing these types of functions and chances are answer a lot of your questions in the process.

Why don't games use expression templates for math?

I can imagine expression templates doing awful things to compile times for things as pervasive as vectors/matrices/quaternions etc, but if it is such a great speed boost why don't games use it? It's quite obvious that SIMD instructions can exploit data level parallelism to great effect. Expression templates and lazy evaluation together seem to make sense, at least when it comes to eliminating temporaries.
So while libraries like Eigen advertise such features, I don't see this done commonly in middleware (e.g. Havok) or games where things are extremely speed critical. Can anyone shed some light on this? Does it have to do with non-determinism or branch prediction?
I can think of a lot of reasons:
it hurts compile-times. Longer compile-times means that testing any change you made to the code takes longer. It hurts productivity.
it's complex. Most likely, many developers on the team are not familiar with expression templates, and will have a hard time reading and debugging them.
Games often have to work on multiple platforms, with various compilers which may have a wide range of shortcomings, which might for example make advanced template trickery problematic.
It's generally not necessary. You can write efficient code without expression templates. It just gets more verbose, and you have to do more hand-holding for the compiler.
Game developers are extremely skeptical of anything that wasn't already used in games 10 years ago. It's not long ago that several major developers stuck to C: not because C++ wasn't good enough, but because it was "new". Game developers are conservative as hell.
And of course, the obvious question: where would they use expression templates? Is there enough complex math to really make it worthwhile? Games tend to rely on a fairly small number of linear algebra operations, which will typically be heavily hand-tuned in any case.
I want to add one more reason not stated in the above answers. Apologies if it is and I missed it.
Adding templates to math based classes, such as a vec3 class, can change the meaning of operators and lead to functions that are invalid for some template types.
Take for instance,
vec3<int> myVec( 3, 5, 4 );
myVec.Normalize();
What would normalize mean to an integer vector? All of the sudden when we add templates to math constructs, we invalidate many existing functions, such as the example described above.
Also, another thing worth mentioning is that many math constructs are optimized with certain types because optimization is so important in games. GPU's are floating point calculating machines. Doubles take up double the space of floats and are quite a bit slower to compute with, even though it may seem like an obvious use case to a new game developer.
I hope this example makes sense. Templates are a great tool but math constructs in games are just not the right place to use them.
Typically the parts of a game that are both performance sensitive and math heavy and still tend to run on the CPU rather than the GPU are applying the same basic operations to large numbers of elements. Some examples are animation blending, physics calculations, visibility tests, etc.
The best approach to optimizing these sorts of problems on current console hardware is generally to try and batch as much work together as possible and to aim for maximum data locality to avoid expensive cache misses. The actual math can then be optimized using SIMD intrinsics and will typically be carefully hand optimized. The kind of optimizations that expression templates give you can be performed relatively easily during that hand optimization phase but there are various other important optimizations that are also likely to be performed that expression templates won't give you. Often this critical code will have sections with custom optimizations for each target platform and won't be very portable.
I think the reason that expression templates aren't widely used is that they add software complexity (for all the reasons described by jalf) to non performance critical code that doesn't really warrant it while not covering all the optimizations that are necessary for the really performance critical code that shows up at the top of profiles.

What's a good technique to unit test digial audio generation

I want to unit test a signal generator - let's say it generates a simple sine wave, or does frequency modulation of a signal onto a sine wave. It's easy enough to define sensible test parameters, and it's well known what the output should "look like" - but this is quite hard to test.
I could do (eg) a frequency analysis on the output and check that, check the maximum amplitude etc, but a) this will make the test code significantly more complicated than the code it's testing and b) doesn't fully test the shape of the output.
Is there an established way to do this?
One way to do this would be to capture a "known good" output and compare bit-for-bit against that. As long as your algorithm is deterministic you should get the same output every time. You might have to recalibrate it occasionally if anything changes, but at least you'll know if it does change at all.
This situation is a strong argument for a modeling tool like Matlab, to generate and review a well understood test set automatically, as well as to provide an environment for automatic comparison and scoring. Especially for instances where combinatorial explosions of test variations take place, automation makes it possible and straight forward generate a huge dataset, locate problems, and pare back if needed to a representative qualification test set.
Often undervalued is the means to generate a large, extensive tests exercising both the requirements and the limits of the implementation of your design. Thinking about and designing those cases up front is also a huge advantage in introducing a clean, problem free system.
One possible semi-automated way of testing is to code up your signal generators from spec by 3 different algorithms, or perhaps by 3 different programmers in 3 different programming languages. Then randomly generate parameters within the complete range of legal control input values and capture and compare the outputs of all 3 generators to see if they agree within some error bound. You could also include some typical and some suspected worse case parameters. If the outputs always agree, then there's a much higher probability that everything works per spec than if they don't.

Comparison of performance between Scala etc. and C/C++/Fortran?

I wonder if there is any reliable comparison of performance between "modern" multithreading-specialized languages like e.g. scala and "classic" "lower-level" languages like C, C++, Fortran using parallel libs like MPI, Posix or even Open-MP.
Any links and suggestions welcome.
Given that Java, and, therefore, Scala, can call external libraries, and given that those highly specialized external libraries will do most of the work, then the performance is the same as long as the same libraries are used.
Other than that, any such comparison is essentially meaningless. Scala code runs on a virtual machine which has run-time optimization. That optimization can push long-running programs towards greater performance than programs compiled with those other languages -- or not. It depends on the specific program written in each language.
Here's another non-answer: go to your local supercomputer centre and ask what fraction of the CPU load is used by each language you are interested in. This will only give you a proxy answer to your question, it will tell you what the people who are concerned with high performance on such machines use when tackling the kind of problem that they tackle. But it's as instructive as any other answer you are likely to get for such a broad question.
PS The answer will be that Fortran, C and C++ consume well in excess of 95% of the CPU cycles.
I'd view such comparisons as a fraction. The numerator is a constant (around 0.00001, I believe). The denominator is the number of threads multiplied by the number of logical processors.
IOW, for a single thread, the comparison has about a one chance in a million of meaning something. For a quad core processor running an application with (say) 16 threads, you're down to one chance in 64 million of a meaningful result.
In short, there are undoubtedly quite a few people working on it, but the chances of even a single result from any of them providing a result that's useful and meaningful is still extremely low. Worse, even if one of them really did mean something, it would be almost impossible to find, and even more difficult to verify to the point that you actually knew it meant something.