I wanted to calculate the angle between two vectors but I have seen these inverse trig operations such as acos and atan uses lots of cpu cycles. Is there a way where I can get this calculation done without using these functions? Also, does these really hit you when you in your optimization?
There are no such functions in the STL; those are in the math library.
Also, are you sure it's important to be efficient here? Have you profiled to see if there's function calls like this in the hot spots? Do you know that the performance is bad when using these functions? You should always answer these questions before diving into such microoptimizations.
In order to give advice, what are you going to do with it? How accurate does it have to be?
If you need the actual angle to a high precision, you probably can't do better. If you need it for some comparison, you can use absolute values and the dot product to get the cosine of the angle. If you don't need precision, you can do that and use an acos lookup table. If you're using it as input for another calculation, you might be able to use a little geometry or maybe a trig identity to avoid having to find an arccosine or arctangent.
In any case, once you've done what optimization you're going to do, do before and after timing runs to see if you've made any significant difference.
This is totally implementation defined. Of course, you could use a third-paty implementation, or an approximation, but first you should profile and determine what your bottlenecks are.
If these functions are indeed the bottleneck, and you only need an approximation, you can try using the few first terms of the Taylor series expansion of those functions. The magnitude of the next unused term represents the error in your approximation.
Arccos Taylor series
Arctan Taylor series
The implementations of atan and acos depend on the compiler and the optimization settings. Many implementations will use a table and interpolate to get the nearest value.
Try these things first:
Profile the application to find the
where most of the execution time is
spent.
Redesign this area for better
performance.
Consider Data Driven Design
techniques to speed up your program.
Change logic to reduce branches and
if statements, consider using
Karnaugh maps to simplify the
logic.
Related
I'm trying to wrap my head around how people that code math functions for games and rendering engines can use an optimized math function in an efficient way; let me explain that further.
There is an high need for fast trigonometric functions in those fields, at times you can optimize a sin, a cos or other functions by rewriting them in a different form that is valid only for a given interval, often times this means that your approximation of f(x) is just for the first quadrant, meaning 0 <= x <= pi/2 .
Now the input for your f(x) is still about all 4 quadrants, but the real formula only covers 1/4 of that interval, the straightforward solution is to detect the quadrant by analyzing the input and see in which range it belongs to, then you adjust the result of the formula accordingly if the input comes from a quadrant that is not the first quadrant .
This is good in theory but this also presents a couple of really bad problems, especially considering the fact that you are doing all this to steal a couple of cycles from your CPU ( you also get a consistent implementation, that is not platform dependent like an hardcoded fsin in Intel x86 asm that only works on x86 and has a certain error range, all of this may differ on other platforms with other asm instructions ), so you should keep things working at a concurrent and high performance level .
The reason I can't wrap my head around the "switch case" with quadrants solution is:
it just prevents possible optimizations, namely memoization, considering that you usually want to put that switch-case inside the same functions that actually computes the f(x), probably the situation can be improved by implementing the formula for f(x) outside said function, but this is will lead to a doubling in the number of functions to maintain for any given math library
increase probability of more branching with a concurrent execution
generally speaking doesn't lead to better, clean, dry code, and conditional statements are often times a potential source of bugs, I don't really like switch-cases and similar things .
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
Note: In the below answer I am speaking very generally about code.
Assuming that I can implement my cross-platform f(x) in C or C++, how the programmers in this field usually address the problem of translating and mapping the inputs, the quadrants to the result via the actual implementation ?
The general answer to this is: In the most obvious and simplest way possible that achieves your purpose.
I'm not entirely sure I follow most of your arguments/questions but I have a feeling you are looking for problems where really none exist. Do you truly have the need to re-implement the trigonometric functions? Don't fall into the trap of NIH (Not Invented Here).
the straightforward solution is to detect the quadrant
Yes! I love straightforward code. Code that is perfectly obvious at a glance what it does. Now, sometimes, just sometimes, you have to do some crazy things to get it to do what you want: for performance, or avoiding bugs out of your control. But the first version should be most obvious and simple code that solves your problem. From there you do testing, profiling, and benchmarking and if (only if) you find performance or other issues, then you go into the crazy stuff.
This is good in theory but this also presents a couple of really bad problems,
I would say that this is good in theory and in practice for most cases and I definitely don't see any "bad" problems. Minor issues in specific corner cases or design requirements at most.
A few things on a few of your specific comments:
approximation of f(x) is just for the first quadrant: Yes, and there are multiple reasons for this. One simply is that most trigonometric functions have identities so you can easily use these to reduce range of input parameters. This is important as many numerical techniques only work over a specific range of inputs, or are more accurate/performant for small inputs. Next, for very large inputs you'll have to trim the range anyways for most numerical techniques to work or at least work in a decent amount of time and have sufficient accuracy. For example, look at the Taylor expansion for cos() and see how long it takes to converge sufficiently for large vs small inputs.
it just prevents possible optimizations: Chances are your c++ compiler these days is way better at optimizations than you are. Sometimes it isn't but the general procedure is to let the compiler do its optimization and only do manual optimizations where you have measured and proven that you need it. Theses days it is very non-intuitive to tell what code is faster by just looking at it (you can read all the questions on SO about performance issues and how crazy some of the root causes are).
namely memoization: I've never seen memoization in place for a double function. Just think how many doubles are there between 0 and 1. Now in reduced accuracy situations you can take advantage of it but this is easily implemented as a custom function tailored for that exact situation. Thinking about it, I'm not exactly sure how to implement memoization for a double function that actually means anything and doesn't loose accuracy or performance in the process.
increase probability of more branching with a concurrent execution: I'm not sure I'd implement trigonometric functions in a concurrent manner but I suppose its entirely possible to get some performance benefits. But again, the compiler is generally better at optimizations than you so let it do its jobs and then benchmark/profile to see if you really need to do better.
doesn't lead to better, clean, dry code: I'm not sure what exactly you mean here, or what "dry code" is for that matter. Yes, sometimes you can get into trouble by too many or too complex if/switch blocks but I can't see a simple check for 4 quadrants apply here...it's a pretty basic and simple case.
So for any platform I get the same y for the same values of x: My guess is that getting "exact" values for all 53 bits of double across multiple platforms and systems is not going to be possible. What is the result if you only have 52 bits correct? This would be a great area to do some tests in and see what you get.
I've used trigonometric functions in C for over 20 years and 99% of the time I just use whatever built-in function is supplied. In the rare case I need more performance (or accuracy) as proven by testing or benchmarking, only then do I actually roll my own custom implementation for that specific case. I don't rewrite the entire gamut of <math.h> functions in the hope that one day I might need them.
I would suggest try coding a few of these functions in as many ways as you can find and do some accuracy and benchmark tests. This will give you some practical knowledge and give you some hard data on whether you actually need to reimplement these functions or not. At the very least this should give you some practical experience with implementing these types of functions and chances are answer a lot of your questions in the process.
Unfortunately, the standard C++ library doesn't have a single call for sincos, which gives a place for this question.
First question:
If I want to calculate a sin and a cos, is it cheaper to calculate a sin and a cos, or calculate a sin then a sqrt(1-sin^2) to get the cos?
Second question:
The intel math kernel library provides very good functions for standard mathematical functions calculations, so a function vdSinCos() exists to solve the problem in a very optimized way, but the intel compiler isn't for free. Is there any open source library (C, C++, Fortran) available in linux distributions that would have those functions where I can simply link to them and have the optimum implementations?
Note: I wouldn't like to go into instruction calls, since they're not supported by all CPUs. I would like to link to a general library that would do the job for me on any CPU.
Thanks.
The GNU C library has a sincos() function, which will take advantage of the "FSINCOS" instruction which most modern instruction sets have. I'd say that's your best bet; it should be just as fast as the Intel library method.
If you don't do that, I'd go with the "sqrt(1-sin(x)^2)" route. In every processor architecture document I've looked at so far, the FSQRT instruction is significantly faster than the FSIN function.
The answer to almost every performance problem is "why don't you measure it in your code", because there are a large number of different factors that affect the performance of almost any calculation like this. For example, "Who produces the math functions". Square root is relatively simple to calculate, but I'm not convinced it's a huge difference between sqrt(1-sin*sin) and calculating cos again. What processor may also be a factor, and what other calculations are done "around" the sin/cos calculations.
I wouldn't be surprised if there is a library around somewhere that has this sort of function, but I haven't been looking.
If precision is not critical the fastest way to get sin or cos is to use tables.
Hold some global const array with sin and cos values for all agles with a step you need. So your sin/cos function just need to cast angle to index and you get the result.
Using double type I made Cubic Spline Interpolation Algorithm.
That work was success as it seems, but there was a relative error around 6% when very small values calculated.
Is double data type enough for accurate scientific numerical analysis?
Double has plenty of precision for most applications. Of course it is finite, but it's always possible to squander any amount of precision by using a bad algorithm. In fact, that should be your first suspect. Look hard at your code and see if you're doing something that lets rounding errors accumulate quicker than necessary, or risky things like subtracting values that are very close to each other.
Scientific numerical analysis is difficult to get right which is why I leave it the professionals. Have you considered using a numeric library instead of writing your own? Eigen is my current favorite here: http://eigen.tuxfamily.org/index.php?title=Main_Page
I always have close at hand the latest copy of Numerical Recipes (nr.com) which does have an excellent chapter on interpolation. NR has a restrictive license but the writers know what they are doing and provide a succinct writeup on each numerical technique. Other libraries to look at include: ATLAS and GNU Scientific Library.
To answer your question double should be more than enough for most scientific applications, I agree with the previous posters it should like an algorithm problem. Have you considered posting the code for the algorithm you are using?
If double is enough for your needs depends on the type of numbers you are working with. As Henning suggests, it is probably best to take a look at the algorithms you are using and make sure they are numerically stable.
For starters, here's a good algorithm for addition: Kahan summation algorithm.
Double precision will be mostly suitable for any problem but the cubic spline will not work well if the polynomial or function is quickly oscillating or repeating or of quite high dimension.
In this case it can be better to use Legendre Polynomials since they handle variants of exponentials.
By way of a simple example if you use, Euler, Trapezoidal or Simpson's rule for interpolating within a 3rd order polynomial you won't need a huge sample rate to get the interpolant (area under the curve). However, if you apply these to an exponential function the sample rate may need to greatly increase to avoid loosing a lot of precision. Legendre Polynomials can cater for this case much more readily.
I have nonlinear equations such as:
Y = f1(X)
Y = f2(X)
...
Y = fn(X)
In general, they don't have exact solution, therefore I use Newton's method to solve them. Method is iteration based and I'm looking for way to optimize calculations.
What are the ways to minimize calculation time? Avoid calculation of square roots or other math functions?
Maybe I should use assembly inside C++ code (solution is written in C++)?
A popular approach for nonlinear least squares problems is the Levenberg-Marquardt algorithm. It's kind of a blend between Gauss-Newton and a Gradient-Descent method. It combines the best of both worlds (navigates well the search space for for ill-posed problems and converges quickly). But there's lots of wiggle room in terms of the implementation. For example, if the square matrix J^T J (where J is the Jacobian matrix containing all derivatives for all equations) is sparse you could use the iterative CG algorithm to solve the equation systems quickly instead of a direct method like a Cholesky factorization of J^T J or a QR decomposition of J.
But don't just assume that some part is slow and needs to be written in assembler. Assembler is the last thing to consider. Before you go that route you should always use a profiler to check where the bottlenecks are.
Are you talking about a number of single parameter functions to solve one at a time or a system of multi-parameter equations to solve together?
If the former, then I've often found that a finding a better initial approximation (from where the Newton-Raphson loop starts) can save more execution time than polishing the loop itself, because convergence in the loop can be slow initially but is fast later. If you know nothing about the functions then finding a decent initial approximation is hard, but it might be worth trying a few secant iterations first. You might also want to look at Brent's method
Consider using Rational Root Test in parallel. If impossible to use values of absolute precision then use closest to zero results as the best fit to continue by Newton method.
Once single root found, you may decrease the equation grade by dividing it with monom (x-root).
Dividing and rational root test are implemented here https://github.com/ohhmm/openmind/blob/sh/omnn/math/test/Sum_test.cpp#L260
How would one go about implementing least squares regression for factor analysis in C/C++?
the gold standard for this is LAPACK. you want, in particular, xGELS.
When I've had to deal with large datasets and large parameter sets for non-linear parameter fitting I used a combination of RANSAC and Levenberg-Marquardt. I'm talking thousands of parameters with tens of thousands of data-points.
RANSAC is a robust algorithm for minimizing noise due to outliers by using a reduced data set. Its not strictly Least Squares, but can be applied to many fitting methods.
Levenberg-Marquardt is an efficient way to solve non-linear least-squares numerically.
The convergence rate in most cases is between that of steepest-descent and Newton's method, without requiring the calculation of second derivatives. I've found it to be faster than Conjugate gradient in the cases I've examined.
The way I did this was to set up the RANSAC an outer loop around the LM method. This is very robust but slow. If you don't need the additional robustness you can just use LM.
Get ROOT and use TGraph::Fit() (or TGraphErrors::Fit())?
Big, heavy piece of software to install just of for the fitter, though. Works for me because I already have it installed.
Or use GSL.
If you want to implement an optimization algorithm by yourself Levenberg-Marquard seems to be quite difficult to implement. If really fast convergence is not needed, take a look at the Nelder-Mead simplex optimization algorithm. It can be implemented from scratch in at few hours.
http://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
Have a look at
http://www.alglib.net/optimization/
They have C++ implementations for L-BFGS and Levenberg-Marquardt.
You only need to work out the first derivative of your objective function to use these two algorithms.
I've used TNT/JAMA for linear least-squares estimation. It's not very sophisticated but is fairly quick + easy.
Lets talk first about factor analysis since most of the discussion above is about regression. Most of my experience is with software like SAS, Minitab, or SPSS, that solves the factor analysis equations, so I have limited experience in solving these directly. That said, that the most common implementations do not use linear regression to solve the equations. According to this, the most common methods used are principal component analysis and principal factor analysis. In a text on Applied Multivariate Analysis (Dallas Johnson), no less that seven methods are documented each with their own pros and cons. I would strongly recommend finding an implementation that gives you factor scores rather than programming a solution from scratch.
The reason why there's different methods is that you can choose exactly what you're trying to minimize. There a pretty comprehensive discussion of the breadth of methods here.