About optimized math functions, ranges and intervals - c++

I'm trying to wrap my head around how people that code math functions for games and rendering engines can use an optimized math function in an efficient way; let me explain that further.
There is an high need for fast trigonometric functions in those fields, at times you can optimize a sin, a cos or other functions by rewriting them in a different form that is valid only for a given interval, often times this means that your approximation of f(x) is just for the first quadrant, meaning 0 <= x <= pi/2 .
Now the input for your f(x) is still about all 4 quadrants, but the real formula only covers 1/4 of that interval, the straightforward solution is to detect the quadrant by analyzing the input and see in which range it belongs to, then you adjust the result of the formula accordingly if the input comes from a quadrant that is not the first quadrant .
This is good in theory but this also presents a couple of really bad problems, especially considering the fact that you are doing all this to steal a couple of cycles from your CPU ( you also get a consistent implementation, that is not platform dependent like an hardcoded fsin in Intel x86 asm that only works on x86 and has a certain error range, all of this may differ on other platforms with other asm instructions ), so you should keep things working at a concurrent and high performance level .
The reason I can't wrap my head around the "switch case" with quadrants solution is:
it just prevents possible optimizations, namely memoization, considering that you usually want to put that switch-case inside the same functions that actually computes the f(x), probably the situation can be improved by implementing the formula for f(x) outside said function, but this is will lead to a doubling in the number of functions to maintain for any given math library
increase probability of more branching with a concurrent execution
generally speaking doesn't lead to better, clean, dry code, and conditional statements are often times a potential source of bugs, I don't really like switch-cases and similar things .
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?

Note: In the below answer I am speaking very generally about code.
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
The general answer to this is: In the most obvious and simplest way possible that achieves your purpose.
I'm not entirely sure I follow most of your arguments/questions but I have a feeling you are looking for problems where really none exist. Do you truly have the need to re-implement the trigonometric functions? Don't fall into the trap of NIH (Not Invented Here).
the straightforward solution is to detect the quadrant
Yes! I love straightforward code. Code that is perfectly obvious at a glance what it does. Now, sometimes, just sometimes, you have to do some crazy things to get it to do what you want: for performance, or avoiding bugs out of your control. But the first version should be most obvious and simple code that solves your problem. From there you do testing, profiling, and benchmarking and if (only if) you find performance or other issues, then you go into the crazy stuff.
This is good in theory but this also presents a couple of really bad problems,
I would say that this is good in theory and in practice for most cases and I definitely don't see any "bad" problems. Minor issues in specific corner cases or design requirements at most.
A few things on a few of your specific comments:
approximation of f(x) is just for the first quadrant: Yes, and there are multiple reasons for this. One simply is that most trigonometric functions have identities so you can easily use these to reduce range of input parameters. This is important as many numerical techniques only work over a specific range of inputs, or are more accurate/performant for small inputs. Next, for very large inputs you'll have to trim the range anyways for most numerical techniques to work or at least work in a decent amount of time and have sufficient accuracy. For example, look at the Taylor expansion for cos() and see how long it takes to converge sufficiently for large vs small inputs.
it just prevents possible optimizations: Chances are your c++ compiler these days is way better at optimizations than you are. Sometimes it isn't but the general procedure is to let the compiler do its optimization and only do manual optimizations where you have measured and proven that you need it. Theses days it is very non-intuitive to tell what code is faster by just looking at it (you can read all the questions on SO about performance issues and how crazy some of the root causes are).
namely memoization: I've never seen memoization in place for a double function. Just think how many doubles are there between 0 and 1. Now in reduced accuracy situations you can take advantage of it but this is easily implemented as a custom function tailored for that exact situation. Thinking about it, I'm not exactly sure how to implement memoization for a double function that actually means anything and doesn't loose accuracy or performance in the process.
increase probability of more branching with a concurrent execution: I'm not sure I'd implement trigonometric functions in a concurrent manner but I suppose its entirely possible to get some performance benefits. But again, the compiler is generally better at optimizations than you so let it do its jobs and then benchmark/profile to see if you really need to do better.
doesn't lead to better, clean, dry code: I'm not sure what exactly you mean here, or what "dry code" is for that matter. Yes, sometimes you can get into trouble by too many or too complex if/switch blocks but I can't see a simple check for 4 quadrants apply here...it's a pretty basic and simple case.
So for any platform I get the same y for the same values of x: My guess is that getting "exact" values for all 53 bits of double across multiple platforms and systems is not going to be possible. What is the result if you only have 52 bits correct? This would be a great area to do some tests in and see what you get.
I've used trigonometric functions in C for over 20 years and 99% of the time I just use whatever built-in function is supplied. In the rare case I need more performance (or accuracy) as proven by testing or benchmarking, only then do I actually roll my own custom implementation for that specific case. I don't rewrite the entire gamut of <math.h> functions in the hope that one day I might need them.
I would suggest try coding a few of these functions in as many ways as you can find and do some accuracy and benchmark tests. This will give you some practical knowledge and give you some hard data on whether you actually need to reimplement these functions or not. At the very least this should give you some practical experience with implementing these types of functions and chances are answer a lot of your questions in the process.

Related

Fast gradient-descent implementation in a C++ library? [duplicate]

I'm looking to run a gradient descent optimization to minimize the cost of an instantiation of variables. My program is very computationally expensive, so I'm looking for a popular library with a fast implementation of GD. What is the recommended library/reference?
GSL is a great (and free) library that already implements common functions of mathematical and scientific interest.
You can peruse through the entire reference manual online. Poking around, this starts to look interesting, but I think we'd need to know more about the problem.
It sounds like you're fairly new to minimization methods. Whenever I need to learn a new set of numeric methods, I usually look in Numerical Recipes. It's a book that provides a nice overview of the most common methods in the field, their tradeoffs, and (importantly) where to look in the literature for more information. It's usually not where I stop, but it's often a helpful starting point.
For example, if your function is costly, then your goal is to minimization the number of evaluations to need to converge. If you have analytical expressions for the gradient, then a gradient-based method will probably work to your advantage, assuming that the function and its gradient are well-behaved (lack singularities) in the domain of interest.
If you don't have analytical gradients, then you're almost always better off using an approach like downhill simplex that only evaluates the function (not its gradients). Numerical gradients are expensive.
Also note that all of these approaches will converge to local minima, so they're fairly sensitive to the point at which you initially start the optimizer. Global optimization is a totally different beast.
As a final thought, almost all of the code you can find for minimization will be reasonably efficient. The real cost of minimization is in the cost function. You should spend time profiling and optimizing your cost function, and select an algorithm that will minimize the number of times you need to call it (methods like downhill simplex, conjugate gradient, and BFGS all shine on different kinds of problems).
In terms of actual code, you can find a lot of nice routines at NETLIB, in addition to the other libraries that have been mentioned. Most of the routines are in FORTRAN 77, but not all; to convert them to C, f2c is quite useful.
One of the best respected libraries for this kind of optimization work is the NAG libraries. These are used all over the world in universities and industry. They're available for C / FORTRAN. They're very non-free, and contain a lot more than just minimisation functions - A lot of general numerical mathematics is covered.
Anyway I suspect this library is overkill for what you need. But here are the parts pertaining to minimisation: Local Minimisation and Global Minimization.
Try CPLEX which is available for free for students.

Why don't games use expression templates for math?

I can imagine expression templates doing awful things to compile times for things as pervasive as vectors/matrices/quaternions etc, but if it is such a great speed boost why don't games use it? It's quite obvious that SIMD instructions can exploit data level parallelism to great effect. Expression templates and lazy evaluation together seem to make sense, at least when it comes to eliminating temporaries.
So while libraries like Eigen advertise such features, I don't see this done commonly in middleware (e.g. Havok) or games where things are extremely speed critical. Can anyone shed some light on this? Does it have to do with non-determinism or branch prediction?
I can think of a lot of reasons:
it hurts compile-times. Longer compile-times means that testing any change you made to the code takes longer. It hurts productivity.
it's complex. Most likely, many developers on the team are not familiar with expression templates, and will have a hard time reading and debugging them.
Games often have to work on multiple platforms, with various compilers which may have a wide range of shortcomings, which might for example make advanced template trickery problematic.
It's generally not necessary. You can write efficient code without expression templates. It just gets more verbose, and you have to do more hand-holding for the compiler.
Game developers are extremely skeptical of anything that wasn't already used in games 10 years ago. It's not long ago that several major developers stuck to C: not because C++ wasn't good enough, but because it was "new". Game developers are conservative as hell.
And of course, the obvious question: where would they use expression templates? Is there enough complex math to really make it worthwhile? Games tend to rely on a fairly small number of linear algebra operations, which will typically be heavily hand-tuned in any case.
I want to add one more reason not stated in the above answers. Apologies if it is and I missed it.
Adding templates to math based classes, such as a vec3 class, can change the meaning of operators and lead to functions that are invalid for some template types.
Take for instance,
vec3<int> myVec( 3, 5, 4 );
myVec.Normalize();
What would normalize mean to an integer vector? All of the sudden when we add templates to math constructs, we invalidate many existing functions, such as the example described above.
Also, another thing worth mentioning is that many math constructs are optimized with certain types because optimization is so important in games. GPU's are floating point calculating machines. Doubles take up double the space of floats and are quite a bit slower to compute with, even though it may seem like an obvious use case to a new game developer.
I hope this example makes sense. Templates are a great tool but math constructs in games are just not the right place to use them.
Typically the parts of a game that are both performance sensitive and math heavy and still tend to run on the CPU rather than the GPU are applying the same basic operations to large numbers of elements. Some examples are animation blending, physics calculations, visibility tests, etc.
The best approach to optimizing these sorts of problems on current console hardware is generally to try and batch as much work together as possible and to aim for maximum data locality to avoid expensive cache misses. The actual math can then be optimized using SIMD intrinsics and will typically be carefully hand optimized. The kind of optimizations that expression templates give you can be performed relatively easily during that hand optimization phase but there are various other important optimizations that are also likely to be performed that expression templates won't give you. Often this critical code will have sections with custom optimizations for each target platform and won't be very portable.
I think the reason that expression templates aren't widely used is that they add software complexity (for all the reasons described by jalf) to non performance critical code that doesn't really warrant it while not covering all the optimizations that are necessary for the really performance critical code that shows up at the top of profiles.

When should I use ASM calls?

I'm planning on writing a game with C++, and it will be extremely CPU-intensive (pathfinding,genetic algorithms, neural networks, ...)
So I've been thinking about how to tackle this situation best so that it would run smoothly.
(let this top section of this question be side information, I don't want it to restrict the main question, but it would be nice if you could give me side notes as well)
Is it worth it to learn how to work with ASM, so I can make ASM calls in C++,
can it give me a significant/notable performance advantage?
In what situations should I use it?
Almost never:
You only want to be using it once you've profiled your C++ code and have identified a particular section as a bottleneck.
And even then, you only want to do it once you've exhausted all C++ optimization options.
And even then, you only want to be using ASM for tight, inner loops.
And even then, it takes quite a lot of effort and skill to beat a C++ compiler on a modern platform.
If your not an experienced assembly programmer, I doubt you will be able to optimize assembly code better than your compiler.
Also note that assembly is not portable. If you decide to go this way, you will have to write different assembly for all the architectures you decide to support.
Short answer: it depends, most likely you won't need it.
Don't start optimizing prematurely. Write code that is also easy to read and to modify. Separate logical sections into modules. Write something that is easy to extend.
Do some profiling.
You can't tell where your bottlenecks are unless you profile your code. 99% of the time you won't get that much performance gain by writing asm. There's a high chance you might even worsen your performance. Optimizers nowadays are very good at what they do. If you do have a bottleneck, it will most probably be because of some poorly chosen algorithm or at least something that can be remedied at a high-level.
My suggestion is, even if you do learn asm, which is a good thing, don't do it just so you can optimize.
Profile profile profile....
A legitimate use case for going low-level (although sometimes a compiler can infer it for you) is to make use of SIMD instructions such as SSE. I would assume that at least some of the algorithms you mention will benefit from parallel processing.
However, you don't need to write actual assembly, instead you can simply use intrinsic functions. See, e.g. this.
Don't get ahead of yourself.
I've posted a sourceforge project showing how a simulation program was massively speeded up (over 700x).
This was not done by assuming in advance what needed to be made fast.
It was done by "profiling", which I put in quotes because the method I use is not to employ a profiler.
Rather I rely on random pausing, a method known and used to good effect by some programmers.
It proceeds through a series of iterations.
In each iteration a large source of time-consumption is identified and fixed, resulting in a certain speedup ratio.
As you proceed through multiple iterations, these speedup ratios multiply together (like compound interest).
That's how you get major speedup.
If, and only if, you get to a point where some code is taking a large fraction of time, and it doesn't contain any function calls, and you think you can write assembly code better than the compiler does, then go for it.
P.S. If you're wondering, the difference between using a profiler and random pausing is that profilers look for "bottlenecks", on the assumption that those are localized things. They look for routines or lines of code that are responsible for a large percent of overall time.
What they miss is problems that are diffuse.
For example, you could have 100 routines, each taking 1% of time.
That is, no bottlenecks.
However, there could be an activity being done within many or all of those routines, accounting for 1/3 of the time, that could be done better or not at all.
Random pausing will see that activity with a small number of samples, because you don't summarize, you examine the samples.
In other words, if you took 9 samples, on average you would notice the activity on 3 of them.
That tells you it's big.
So you can fix it and get your 3/2 speedup ratio.
"To understand recursion, you must first understand recursion." That quote comes to mind when I consider my response to your question, which is "until you understand when to use assembly, you should never use assembly." After you have completely implemented your soution, extensively profiled its performance and determined precise bottlenecks, and experimented with several alternative solutions, then you can begin to consider using assembly. If you code a single line of assembly before you have a working and extensively profiled program, you have made a mistake.
If you need to ask than you don't need it.

How to test scientific software?

I'm convinced that software testing indeed is very important, especially in science. However, over the last 6 years, I never have come across any scientific software project which was under regular tests (and most of them were not even version controlled).
Now I'm wondering how you deal with software tests for scientific codes (numerical computations).
From my point of view, standard unit tests often miss the point, since there is no exact result, so using assert(a == b) might prove a bit difficult due to "normal" numerical errors.
So I'm looking forward to reading your thoughts about this.
I am also in academia and I have written quantum mechanical simulation programs to be executed on our cluster. I made the same observation regarding testing or even version control. I was even worse: in my case I am using a C++ library for my simulations and the code I got from others was pure spaghetti code, no inheritance, not even functions.
I rewrote it and I also implemented some unit testing. You are correct that you have to deal with the numerical precision, which can be different depending on the architecture you are running on. Nevertheless, unit testing is possible, as long as you are taking these numerical rounding errors into account. Your result should not depend on the rounding of the numerical values, otherwise you would have a different problem with the robustness of your algorithm.
So, to conclude, I use unit testing for my scientific programs, and it really makes one more confident about the results, especially with regards to publishing the data in the end.
Just been looking at a similar issue (google: "testing scientific software") and came up with a few papers that may be of interest. These cover both the mundane coding errors and the bigger issues of knowing if the result is even right (depth of the Earth's mantle?)
http://http.icsi.berkeley.edu/ftp/pub/speech/papers/wikipapers/cox_harris_testing_numerical_software.pdf
http://www.cs.ua.edu/~SECSE09/Presentations/09_Hook.pdf (broken link; new link is http://www.se4science.org/workshops/secse09/Presentations/09_Hook.pdf)
http://www.associationforsoftwaretesting.org/?dl_name=DianeKellyRebeccaSanders_TheChallengeOfTestingScientificSoftware_paper.pdf
I thought the idea of mutation testing described in 09_Hook.pdf (see also matmute.sourceforge.net) is particularly interesting as it mimics the simple mistakes we all make. The hardest part is to learn to use statistical analysis for confidence levels, rather than single pass code reviews (man or machine).
The problem is not new. I'm sure I have an original copy of "How accurate is scientific software?" by Hatton et al Oct 1994, that even then showed how different implementations of the same theories (as algorithms) diverged rather rapidly (It's also ref 8 in Kelly & Sanders paper)
--- (Oct 2019)
More recently Testing Scientific Software: A Systematic Literature Review
I'm also using cpptest for its TEST_ASSERT_DELTA. I'm writing high-performance numerical programs in computational electromagnetics and I've been happily using it in my C++ programs.
I typically go about testing scientific code the same way as I do with any other kind of code, with only a few retouches, namely:
I always test my numerical codes for cases that make no physical sense and make sure the computation actually stops before producing a result. I learned this the hard way: I had a function that was computing some frequency responses, then supplied a matrix built with them to another function as arguments which eventually gave its answer a single vector. The matrix could have been any size depending on how many terminals the signal was applied to, but my function was not checking if the matrix size was consistent with the number of terminals (2 terminals should have meant a 2 x 2 x n matrix); however, the code itself was wrapped so as not to depend on that, it didn't care what size the matrices were since it just had to do some basic matrix operations on them. Eventually, the results were perfectly plausible, well within the expected range and, in fact, partially correct -- only half of the solution vector was garbled. It took me a while to figure. If your data looks correct, it's assembled in a valid data structure and the numerical values are good (e.g. no NaNs or negative number of particles) but it doesn't make physical sense, the function has to fail gracefully.
I always test the I/O routines even if they are just reading a bunch of comma-separated numbers from a test file. When you're writing code that does twisted math, it's always tempting to jump into debugging the part of the code that is so math-heavy that you need a caffeine jolt just to understand the symbols. Days later, you realize you are also adding the ASCII value of \n to your list of points.
When testing for a mathematical relation, I always test it "by the book", and I also learned this by example. I've seen code that was supposed to compare two vectors but only checked for equality of elements and did not check for equality of length.
Please take a look at the answers to the SO question How to use TDD correctly to implement a numerical method?

Fast exponentiation: real^real (C++ MinGW, Code::Blocks)

I am writing an application where in a certain block I need to exponentiate reals around 3*500*500 times. When I use the exp(y*log(x)) algorithm, the program noticeably lags. It is significantly faster if I use another algorithm based on playing with data types, but that algorithm isn't very precise, although provides decent results for the simulation, and it's still not perfect in terms of speed.
Is there any precise exponentiation algorithm for real powers faster than exp(y*log(x))?
Thank you in advance.
If you need good accuracy, and you don't know anything about the distribution of bases (x values) a priori, then pow(x, y) is the best portable answer (on many -- not all -- platforms, this will be faster than exp(y*log(x)), and is also better behaved numerically). If you do know something about what ranges x and y can lie in, and with what distribution, that would be a big help for people trying to offer advice.
The usual way to do it faster while keeping good accuracy is to use a library routine designed to do many of these computations simultaneously for an array of x values and an array of y values. The catch is that such library implementations tend to cost money (like Intel's MKL) or be platform-specific (vvpowf in the Accelerate.framework on OS X, for example). I don't know much about MinGW, so someone else will need to tell you what's available there. The GSL may have something along these lines.
Depending on your algorithm (in particular if you have few to no additions), sometimes you can get away with working (at least partially) in log-space. You've probably already considered this, but if your intermediate representation is log_x and log_y then log(x^y) = exp(log_y) * log_x, which will be faster. If you can even be selective about it, then obviously computing log(x^y) as y * log_x is even cheaper. If you can avoid even just a few exponentiations, you may win a lot of performance. If there's any way to rewrite whatever loops you have to get the exponentiation operations outside of the innermost loop, that's a fairly certain performance win.