Does calculating Sqrt(x) as x * InvSqrt(x) make any sense in the Doom 3 BFG code? - c++

I browsed through the recently released Doom 3 BFG source code, when I came upon something that does not appear to make any sense. Doom 3 wraps mathematical functions in the idMath class. Some of the functions just foward to the corresponding functions from math.h, but some are reimplementations (e.g. idMath::exp16()) that I assume have a higher performance than their math.h counterparts (maybe at the expense of precision).
What baffles me, however, is the way they have implemented the float idMath::Sqrt(float x) function:
ID_INLINE float idMath::InvSqrt( float x ) {
return ( x > FLT_SMALLEST_NON_DENORMAL ) ? sqrtf( 1.0f / x ) : INFINITY;
}
ID_INLINE float idMath::Sqrt( float x ) {
return ( x >= 0.0f ) ? x * InvSqrt( x ) : 0.0f;
}
This appears to perform two unnecessary floating point operations: First a division and then a multiplication.
It is interesting to note that the original Doom 3 source code also implemented the square root function in this way, but the inverse square root uses the fast inverse square root algorithm.
ID_INLINE float idMath::InvSqrt( float x ) {
dword a = ((union _flint*)(&x))->i;
union _flint seed;
assert( initialized );
double y = x * 0.5f;
seed.i = (( ( (3*EXP_BIAS-1) - ( (a >> EXP_POS) & 0xFF) ) >> 1)<<EXP_POS) | iSqrt[(a >> (EXP_POS-LOOKUP_BITS)) & LOOKUP_MASK];
double r = seed.f;
r = r * ( 1.5f - r * r * y );
r = r * ( 1.5f - r * r * y );
return (float) r;
}
ID_INLINE float idMath::Sqrt( float x ) {
return x * InvSqrt( x );
}
Do you see any advantage in calculating Sqrt(x) as x * InvSqrt(x) if InvSqrt(x) internally just calls math.h's fsqrt(1.f/x)? Am I maybe missing something important about denormalized floating point numbers here or is this just sloppiness on id software's part?

I can see two reasons for doing it this way: firstly, the "fast invSqrt" method (really Newton Raphson) is now the method used in a lot of hardware, so this approach leaves open the possibility of taking advantage of such hardware (and doing potentially four or more such operations at once). This article discusses it a little bit:
How slow (how many cycles) is calculating a square root?
The second reason is for compatibility. If you change the code path for calculating square roots, you may get different results (especially for zeroes, NaNs, etc.), and lose compatibility with code that depended on the old system.

As far as I know, the InvSqrt is used to compute colors in the sense that color depends on the angle from which light bounces off a surface, which gives you some function using the inverse of the square root.
In their case, they don't need huge precision when computing these numbers, so the engineers behind Doom 3's code (originally from Quake III) came up with a very very very fast method of computing an approximation for InvSqrt using only several Newton-Raphson's iterations.
This is why they use InvSqrt in all their code, instead of using built-in (slower) functions. I guess the use of x * InvSqrt(x) is there to avoid multiplying work by two (by having two very efficient functions, one for InvSqrt and another for Sqrt).
You should read this article, it might shed some light on this issue.

When code has been modified by multiple people, it becomes hard to answer questions about why it has its current form, especially without revision history.
However, given a third of a century of programming experience, this code fits the pattern others have mentioned: At one time, InvSqrt was fast, and it made sense to use it to compute the square root. Then InvSqrt changed, and nobody updated Sqrt.

It is also possible that they came across a relatively naive version of sqrtf which was notably slower for bigger numbers.

Related

Calculation sine and cosine in one shot

I have a scientific code that uses both sine and cosine of the same argument (I basically need the complex exponential of that argument). I was wondering if it were possible to do this faster than calling sine and cosine functions separately.
Also I only need about 0.1% precision. So is there any way I can find the default trig functions and truncate the power series for speed?
One other thing I have in mind is, is there any way to perform the remainder operation such that the result is always positive? In my own algorithm I used x=fmod(x,2*pi); but then I would need to add 2pi if x is negative (smaller domain means I can use a shorter power series)
EDIT: LUT turned out to be the best approach for this, however I am glad I learned about other approximation techniques. I will also advise using an explicit midpoint approximation. This is what I ended up doing:
const int N = 10000;//about 3e-4 error for 1000//3e-5 for 10 000//3e-6 for 100 000
double *cs = new double[N];
double *sn = new double[N];
for(int i =0;i<N;i++){
double A= (i+0.5)*2*pi/N;
cs[i]=cos(A);
sn[i]=sin(A);
}
The following part approximates (midpoint) sincos(2*pi*(wc2+t[j]*(cotp*t[j]-wc)))
double A=(wc2+t[j]*(cotp*t[j]-wc));
int B =(int)N*(A-floor(A));
re += cs[B]*f[j];
im += sn[B]*f[j];
Another approach could have been using the chebyshev decomposition. You can use the orthogonality property to find the coefficients. Optimized for exponential, it looks like this:
double fastsin(double x){
x=x-floor(x/2/pi)*2*pi-pi;//this line can be improved, both inside this
//function and before you input it into the function
double x2 = x*x;
return (((0.00015025063885163012*x2-
0.008034350857376128)*x2+ 0.1659789684145034)*x2-0.9995812174943602)*x;} //7th order chebyshev approx
If you seek fast evaluation with good (but not high) accuracy with powerseries you should use an expansion in Chebyshev polynomials: tabulate the coefficients (you'll need VERY few for 0.1% accuracy) and evaluate the expansion with the recursion relations for these polynomials (it's really very easy).
References:
Tabulated coefficients: http://www.ams.org/mcom/1980-34-149/S0025-5718-1980-0551302-5/S0025-5718-1980-0551302-5.pdf
Evaluation of chebyshev expansion: https://en.wikipedia.org/wiki/Chebyshev_polynomials
You'll need to (a) get the "reduced" argument in the range -pi/2..+pi/2 and consequently then (b) handle the sign in your results when the argument actually should have been in the "other" half of the full elementary interval -pi..+pi. These aspects should not pose a major problem:
determine (and "remember" as an integer 1 or -1) the sign in the original angle and proceed with the absolute value.
use a modulo function to reduce to the interval 0..2PI
Determine (and "remember" as an integer 1 or -1) whether it is in the "second" half and, if so, subtract pi*3/2, otherwise subtract pi/2. Note: this effectively interchanges sine and cosine (apart from signs); take this into account in the final evaluation.
This completes the step to get an angle in -pi/2..+pi/2
After evaluating sine and cosine with the Cheb-expansions, apply the "flags" of steps 1 and 3 above to get the right signs in the values.
Just create a lookup table. The following will let you lookup the sin and cos of any radian value between -2PI and 2PI.
// LOOK UP TABLE
var LUT_SIN_COS = [];
var N = 14400;
var HALF_N = N >> 1;
var STEP = 4 * Math.PI / N;
var INV_STEP = 1 / STEP;
// BUILD LUT
for(var i=0, r = -2*Math.PI; i < N; i++, r += STEP) {
LUT_SIN_COS[2*i] = Math.sin(r);
LUT_SIN_COS[2*i + 1] = Math.cos(r);
}
You index into the lookup table by:
var index = ((r * INV_STEP) + HALF_N) << 1;
var sin = LUT_SIN_COS[index];
var cos = LUT_SIN_COS[index + 1];
Here's a fiddle that displays the % error you can expect from different sized LUTS http://jsfiddle.net/77h6tvhj/
EDIT Here's an ideone (c++) with a ~benchmark~ vs the float sin and cos. http://ideone.com/SGrFVG For whatever a benchmark on ideone.com is worth the LUT is 5 times faster.
One way to go would be to learn how to implement the CORDIC algorithm. It is not difficult and pretty interesting intelectually. This gives you both the cosine and the sine. Wikipedia gives a MATLAB example that should be easy to adapt in C++.
Note that you can augment speed and reduce precision simply by lowering the parameter n.
About your second question, it has already been asked here (in C). It seems that there is no simple way.
You can also calculate sine using a square root, given the angle and the cosine.
The example below assumes the angle ranges from 0 to 2π:
double c = cos(angle);
double s = sqrt(1.0-c*c);
if(angle>pi)s=-s;
For single-precision floats, Microsoft uses 11-degree polynomial approximation for sine, 10-degree for cosine: XMScalarSinCos.
They also have faster version, XMScalarSinCosEst, that uses lower-degree polynomials.
If you aren’t on Windows, you’ll find same code + coefficients on geometrictools.com under Boost license.

Best method for scaling a vector to desired length

The question is about the most robust and the fastest implementation about this rather basic operation:
Given a vector (X,Y) compute the collinear vector of the given length desiredLength. There are at least two methods for that:
One. Find the length of (X,Y) and rescale accordingly:
double currentLength = sqrt(X*X + Y*Y);
if(currentLength == 0) { /* Aye, Caramba! */ }
double factor = desiredLength / currentLength;
X *= factor;
Y *= factor;
Two. Find the direction of (X,Y) and form a vector of desiredLength in that direction:
if(X == 0 && Y == 0) { /* Aye, Caramba! */ }
double angle = atan2(Y, X);
X = desiredLength * cos(angle);
Y = desiredLength * sin(angle);
Which method would be preferable for developing robust app, better numerical stability, faster execution, etc.?
There's no one right answer, since it will depend on the implementation.
However: on any reasonable modern implementation, the four basic
operations and sqrt will be exact to the last binary digit. From a
quality of implementation point of view, one would hope that the same
thing would be true for all of the functions in math.h, but it's less
certain. On a machine with IEEE arithmetic (Windows and the mainstream
Unix platforms), the four operations and sqrt will be implemented in
hardware, where as the trigonomic operations will generally require a
software implementation, often requiring tens of more basic operations.
Although some floating point processors do support them directly, at
least over limited ranges, even then, they are often a magnitude slower
than the four basic operations.
All of which speaks in favor of your first implementation, at least with
regards to speed (and probably with regards to numeric stability as
well).
I would expect method one to be better At least on the performance front as doing sqrt + 2 multiplications should be cheaper than 3 trig operations.
I would guess it is also better (or not worse) on the other fronts as well since it involves one approximation (sqrt) instead of 2 (per axis). The sqrt approximation is also "shared" by both x and y.
Thu shalt better use hypot(x,y) rather than sqrt(x*x+y*y) because a reasonnable implementation of hypot can save you from underflow/overflow conditions.
Examples: hypot(1.0e300,1.0e300) or hypot(1.0e-300,1.0e-300)
Then evaluating x/hypot(x,y) is safe, even in case of gradual underflow (denormalized numbers) like x=1.0e-320, y=0, while evaluating desiredLength/hypot(x,y) might well overflow.
So I would write
double h = hypot(x,y);
double xd = desiredLength*(x/h);
double yd = desiredLength*(y/h);
You'll get some division by zero exception and nan results if both x,y are zero, so don't bother handling it in a if.

Create a Fast Sin() function to improve fps ? Fast sin() function?

I am rendering 500x500 points in real-time.
I have to compute the position of points using atan() and sin() functions. By using atan() and sin() I am getting 24 fps (frames per second).
float thetaC = atan(value);
float h = (value) / (sin(thetaC)));
If I don't use sin() I am getting 52 fps.
and if I dont use atan() I am 30 fps.
So, the big problem is with sin(). How can I use Fast Sin version. Can I create a Look Up Table for that ? I don't have any specific values to create LUT. what can I do in this situation ?
PS: I have also tried fast sin function of ASM but not getting any difference.
Thanks.
Hang on a second....
You have a triangle, you're computing the hypoteneuse. First, you're taking atan(value) to get the angle, and then using value again with sin to compute h. So we have the scenario where one side of the triangle is 1:
/|
h / | value
/ |
/C__|
1
All you really need to do is calculate h = sqrt(value*value + 1); ... But then, sqrt isn't the fastest function around either.
Perhaps I've missed the point or you've left something out. I've always used lookup tables for sin and cos, and found them to be fast. If you don't know the values ahead of time then you need to approximate, but this means a multiplication, truncation to integer (and possibly sign conversion) in order to get the array index. If you can convert your units to work in integers (effectively making your floats into fixed-point), it makes the lookup even quicker.
It depends on the accuracy that you need. The maximum derivative of sin is 1, so if if x1 and x2 are within epsilon of one another, then sin(x1) and sin(x2) are also within epsilon. If you just need accuracy to within, say 0.001, then you can create a lookup table of 1000 * PI = 3142 points, and just look up the value closest to the one you need. This can be faster than what the native code does, since the native code (probably) uses a lookup table for polynomial coefficients, and then interpolates, and since this table can be small enough to stay in cache easily.
If you need complete accuracy over the whole range, then there's probably nothing better that you can do.
If you wanted, you could also create a lookup table over (1/sin(x)), since that's your actual function of interest. Either way, you'll want to be careful around sin(x) = 0, since a small error in sin(x) can cause a big error in 1/sin(x). Defining your error tolerance is important for figuring out what shortcuts you can take.
You'd create the lookup table with something like:
float *table = malloc(1000 * sizeof(float));
for(int i = 0; i < 1000; i++){
table[i] = sin(i/1000.0);
}
and would access it something like
float fastSin(float x){
int index = x * 1000.0;
return table[index];
}
This code isn't complete (and will crash for anything outside of 0 < x < 1, because of array bounds), but should get you started.
For sin (but not atan) you can actually get simpler than a table--just create
float sin_arr[31416]; //Or as much precision as you need
for (int i=0; i<31416; ++i)
sin_arr[i] = sin( i / 10000.0 );
//...
float h = (value) / sin_arr[ (int)(thetaC*10000.0) % 31416 ];
My guess is that this will give you a speed improvement.

Using the gaussian probability density function in C++

First, is this the correct C++ representation of the pdf gaussian function ?
float pdf_gaussian = ( 1 / ( s * sqrt(2*M_PI) ) ) * exp( -0.5 * pow( (x-m)/s, 2.0 ) );
Second, does it make sense of we do something like this ?
if(pdf_gaussian < uniform_random())
do something
else
do other thing
EDIT: An example of what exactly are you trying to achieve:
Say I have a data called Y1. Then a new data called Xi arrive. I want to see if I should associated Xi to Y1 or if I should keep Xi as a new data data that will be called Y2. This is based on the distance between the new data Xi and the existing data Y1. If Xi is "far" from Y1 then Xi will not be associated to Y1, otherwise if it is "not far", it will be associated to Y1. Now I want to model this "far" or "not far" using a gaussian probability based on the mean and stdeviation of distances between Y and the data that where already associated to Y in the past.
Technically,
float pdf_gaussian = ( 1 / ( s * sqrt(2*M_PI) ) ) * exp( -0.5 * pow( (x-m)/s, 2.0 ) );
is not incorrect, but can be improved.
First, 1 / sqrt(2 Pi) can be precomputed, and using pow with integers is not a good idea: it may use exp(2 * log x) or a routine specialized for floating point exponents instead of simply x * x.
Example better code:
float normal_pdf(float x, float m, float s)
{
static const float inv_sqrt_2pi = 0.3989422804014327;
float a = (x - m) / s;
return inv_sqrt_2pi / s * std::exp(-0.5f * a * a);
}
You may want to make this a template instead of using float:
template <typename T>
T normal_pdf(T x, T m, T s)
{
static const T inv_sqrt_2pi = 0.3989422804014327;
T a = (x - m) / s;
return inv_sqrt_2pi / s * std::exp(-T(0.5) * a * a);
}
this allows you to use normal_pdf on double arguments also (it is not that much more generic though). There are caveats with the last code, namely that you have to beware not using it with integers (there are workarounds, but this makes the routine more verbose).
yes. boost::random has gaussian distribution.
See, for example, this question: How to use boost normal distribution classes?
As an alternative, there's a standard way of converting two uniformly distributed random numbers into two normally distributed numbers.
See, e.g. this question: Generate random numbers following a normal distribution in C/C++
In response to your last edit (note that the question is completely different as edited, hence my answer to an original one is irrelevant). I think you'd better off first formulating for yourself what exactly do you mean to mean by "modelling far or not far using a gaussian distribution". Then reformulate that understanding in math terms and only then start programming. As it stands, I think the problem is underspecified.
Use Box-Muller transform. This creates values with a normal/gaussian distribution.
http://en.wikipedia.org/wiki/Normal_distribution#Generating_values_from_normal_distribution
http://en.wikipedia.org/wiki/Box_Muller_transform
It's not very complex to code using math libraries.
eg.
Generate 2 uniform numbers, use them to get two normally distributed numbers. Then Return one and save the other so that you have it for your 'next' request of a random number.
Since C++ 11, std::normal_distribution defined in the standard header random can be used to generate Gaussian random samples. More information can be found herein.

Fast equivalent to sin() for DSP referenced in STK

I'm using bits of Perry Cook's Synthesis Toolkit (STK) to generate saw and square waves. STK includes this BLIT-based sawtooth oscillator:
inline STKFloat BlitSaw::tick( void ) {
StkFloat tmp, denominator = sin( phase_ );
if ( fabs(denominator) <= std::numeric_limits<StkFloat>::epsilon() )
tmp = a_;
else {
tmp = sin( m_ * phase_ );
tmp /= p_ * denominator;
}
tmp += state_ - C2_;
state_ = tmp * 0.995;
phase_ += rate_;
if ( phase_ >= PI )
phase_ -= PI;
lastFrame_[0] = tmp;
return lastFrame_[0];
}
The square wave oscillator is broadly similar. At the top, there's this comment:
// A fully optimized version of this code would replace the two sin
// calls with a pair of fast sin oscillators, for which stable fast
// two-multiply algorithms are well known.
I don't know where to start looking for these "fast two-multiply algorithms" and I'd appreciate some pointers. I could use a lookup table instead, but I'm keen to learn what these 'fast sin oscillators' are. I could also use an abbreviated Taylor series, but thats way more than two multiplies. Searching hasn't turned up anything much, although I did find this approximation:
#define AD_SIN(n) (n*(2.f- fabs(n)))
Plotting it out shows that it's not really a close approximation outside the range of -1 to 1, so I don't think I can use it when phase_ is in the range -pi to pi:
Here, Sine is the blue line and the purple line is the approximation.
Profiling my code reveals that the calls to sin() are far and away the most time-consuming calls, so I really would like to optimise this piece.
Thanks
EDIT Thanks for the detailed and varied answers. I will explore these and accept one at the weekend.
EDIT 2 Would the anonymous close voter please kindly explain their vote in the comments? Thank you.
Essentially the sinusoidal oscilator is one (or more) variables that change with each DSP step, rather than getting recalculated from scratch.
The simplest are based on the following trig identities: (where d is constant, and thus so is cos(d) and sin(d) )
sin(x+d) = sin(x) cos(d) + cos(x) sin(d)
cos(x+d) = cos(x) cos(d) - sin(x) sin(d)
However this requires two variables (one for sin and one for cos) and 4 multiplications to update. However this will still be far faster than calculating a full sine at each step.
The solution by Oli Charlesworth is based on solutions to this general equation
A_{n+1} = a A_{n} + A_{n-1}
Where looking for a solution of the form A_n = k e^(i theta n) gives an equation for theta.
e^(i theta (n+1) ) = a e^(i theta n ) + b e^(i theta (n-1) )
Which simplifies to
e^(i theta) - e^(-i theta ) = a
2 cos(theta) = a
Giving
A_{n+1} = 2 cos(theta) A_{n} + A_{n-1}
Whichever approach you use you'll either need to use one or two of these oscillators for each frequency, or use another trig identity to derive the higher or lower frequencies.
How accurate do you need this?
This function, f(x)=0.398x*(3.1076-|x|), does a reasonably good job for x between -pi and pi.
Edit
An even better approximation is f(x)=0.38981969947653056*(pi-|x|), which keeps the absolute error to 0.038158444604 or less for x between -pi and pi.
A least squares minimization will yield a slightly different function.
It's not possible to generate one-off sin calls with just two multiplies (well, not a useful approximation, at any rate). But it is possible to generate an oscillator with low complexity, i.e. where each value is calculated in terms of the preceding ones.
For instance, consider that the following difference equation will give you a sinusoid:
y[n] = 2*cos(phi)*y[n-1] - y[n-2]
(where cos(phi) is a constant)
(From the original author of the VST BLT code).
As a matter of fact, I was porting the VST BLT oscillators to C#, so I was googling for good sin oscillators. Here's what I came up with. Translation to C++ is straightforward. See the notes at the end about accuumulated round-off errors.
public class FastOscillator
{
private double b1;
private double y1, y2;
private double fScale;
public void Initialize(int sampleRate)
{
fScale = AudioMath.TwoPi / sampleRate;
}
// frequency in Hz. phase in radians.
public void Start(float frequency, double phase)
{
double w = frequency * fScale;
b1 = 2.0 * Math.Cos(w);
y1 = Math.Sin(phase - w);
y2 = Math.Sin(phase - w * 2);
}
public double Tick()
{
double y0 = b1 * y1 - y2;
y2 = y1;
y1 = y0;
return y0;
}
}
Note that this particular oscillator implementation will drift over time, so it needs to be re-initialzed periodically. In this particular implementation, the magnitude of the sin wave decays over time. The original comments in the STK code suggested a two-multiply oscillator. There are, in fact, two-multiply oscillators that are reasonably stable over time. But in retrospect, the need to keep the sin(phase), and sin(m*phase) oscillators tightly in synch probably means that they have to be resynched anyway. Round-off errors between phase and m*phase mean that even if the oscillators were stable, they would drift eventually, running a significant risk of producing large spikes in values near the zeros of the BLT functions. May as well use a one-multiply oscillator.
These particular oscillators should probably be re-initialized every 30 to 100 cycles (or so). My C# implementation is frame based (i.e. it calculates an float[] array of results in a void Tick(int count, float[] result) method. The oscillators are re-synched at the end of each Tick call. Something like this:
void Tick(int count, float[] result)
{
for (int i = 0; i < count; ++i)
{
...
result[i] = bltResult;
}
// re-initialize the oscillators to avoid accumulated drift.
this.phase = (this.phase + this.dPhase*count) % AudioMath.TwoPi;
this.sinOsc.Initialize(frequency,this.phase);
this.mSinOsc.Initialize(frequency*m,this.phase*m);
}
Probably missing from the STK code. You might want to investigate this. The original code provided to the STK did this. Gary Scavone tweaked the code a bit, and I think the optimization was lost. I do know that the STK implementations suffer from DC drift, which can be almost entirely eliminated when implemented properly.
There's a peculiar hack that prevents DC drift of the oscillators, even when sweeping the frequency of the oscillators. The trick is that the oscillators should be started with an initial phase adjustment of dPhase/2. That just so happens to start the oscillators off with zero DC drift, without having to figure out wat the correct initial state for various integrators in each of the BLT oscillators.
Strangely, if the adjustment is re-adjusted whenever the frequency of the oscillator changes, then this also prevents wild DC drift of the output when sweeping the frequency of the oscillator. Whenever the frequency changes, subtract dPhase/2 from the previous phase value, recalculate dPhase for the new frequency, and then add dPhase/2.I rather suspect this could be formally proven; but I have not been able to so. All I know is that It Just Works.
For a block implementation, the oscillators should actually be initialized as follows, instead of carrying the phase adjustment in the current this.phase value.
this.sinOsc.Initialize(frequency,phase+dPhase*0.5);
this.mSinOsc.Initialize(frequency*m,(phase+dPhase*0.5)*m);
You might want to take a look here:
http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/
There's some sample code that calculates a very good appoximation of sin/cos using only multiplies, additions and the abs() function. Quite fast too. The comments are also a good read.
It essentiall boils down to this:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
const float P = 0.225;
float y = B * x + C * x * abs(x);
return P * (y * abs(y) - y) + y;
}
and works for a range of -PI to PI
If you can, you should consider memorization based techniques. Essentially store sin(x) and cos(x) values for a bunch values. To calculate sin(y), find a and b for which precomputed values exist such that a<=y<=b. Now using sin(a), sin(b), cos(a), cos(b), y-a and y-b approximately calculate sin(y).
The general idea of getting periodically sampled results from the sine or cosine function is to use a trig recursion or an initialized (barely) stable IIR filter (which can end up being pretty much the same computations). There are bunches of these in the DSP literature, of varying accuracy and stability. Choose carefully.