C++ (and maths) : fast approximation of a trigonometric function

C++ (and maths) : fast approximation of a trigonometric function - c++

I know this is a recurring question, but I haven't really found a useful answer yet. I'm basically looking for a fast approximation of the function acos in C++, I'd like to know if I can significantly beat the standard one.
But some of you might have insights on my specific problem: I'm writing a scientific program which I need to be very fast. The complexity of the main algorithm boils down to computing the following expression (many times with different parameters):
sin( acos(t_1) + acos(t_2) + ... + acos(t_n) )
where the t_i are known real (double) numbers, and n is very small (like smaller than 6). I need a precision of at least 1e-10. I'm currently using the standard sin and acos C++ functions.
Do you think I can significantly gain speed somehow? For those of you who know some maths, do you think it would be smart to expand that sine in order to get an algebraic expression in terms of the t_i (only involving square roots)?
Thank you your your answers.

The code below provides simple implementations of sin() and acos() that should satisfy your accuracy requirements and that you might want to try. Please note that the math library implementation on your platform is very likely highly tuned for the specific hardware capabilities of that platform and is probably also coded in assembly for maximum efficiency, so simple compiled C code not catering to specifics of the hardware is unlikely to provide higher performance, even when the accuracy requirements are somewhat relaxed from full double precision. As Viktor Latypov points out, it may also be worthwhile to search for algorithmic alternatives that do not require expensive calls to transcendental math functions.
In the code below I have tried to stick to simple, portable constructs. If your compiler supports the rint() function [specified by C99 and C++11] you might want to use that instead of my_rint(). On some platforms, the call to floor() can be expensive since it requires dynamic changing of machine state. The functions my_rint(), sin_core(), cos_core(), and asin_core() would want to be inlined for best performance. Your compiler may do that automatically at high optimization levels (e.g. when compiling with -O3), or you could add an appropriate inlining attribute to these functions, e.g. inline or __inline depending on your toolchain.
Not knowing anything about your platform I opted for simple polynomial approximations, which are evaluated using Estrin's scheme plus Horner's scheme. See Wikipedia for a description of these evaluation schemes:
http://en.wikipedia.org/wiki/Estrin%27s_scheme ,
http://en.wikipedia.org/wiki/Horner_scheme
The approximations themselves are of the minimax type and were custom generated for this answer with the Remez algorithm:
http://en.wikipedia.org/wiki/Minimax_approximation_algorithm ,
http://en.wikipedia.org/wiki/Remez_algorithm
The identities used in the argument reduction for acos() are noted in the comments, for sin() I used a Cody/Waite-style argument reduction, as described in the following book:
W. J. Cody, W. Waite, Software Manual for the Elementary Functions. Prentice-Hall, 1980
The error bounds mentioned in the comments are approximate, and have not been rigorously tested or proven.
/* not quite rint(), i.e. results not properly rounded to nearest-or-even */
double my_rint (double x)
{
double t = floor (fabs(x) + 0.5);
return (x < 0.0) ? -t : t;
}
/* minimax approximation to cos on [-pi/4, pi/4] with rel. err. ~= 7.5e-13 */
double cos_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using Estrin's scheme */
return (-2.7236370439787708e-7 * x2 + 2.4799852696610628e-5) * x8 +
(-1.3888885054799695e-3 * x2 + 4.1666666636943683e-2) * x4 +
(-4.9999999999963024e-1 * x2 + 1.0000000000000000e+0);
}
/* minimax approximation to sin on [-pi/4, pi/4] with rel. err. ~= 5.5e-12 */
double sin_core (double x)
{
double x4, x2, t;
x2 = x * x;
x4 = x2 * x2;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return ((2.7181216275479732e-6 * x2 - 1.9839312269456257e-4) * x4 +
(8.3333293048425631e-3 * x2 - 1.6666666640797048e-1)) * x2 * x + x;
}
/* minimax approximation to arcsin on [0, 0.5625] with rel. err. ~= 1.5e-11 */
double asin_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return (((4.5334220547132049e-2 * x2 - 1.1226216762576600e-2) * x4 +
(2.6334281471361822e-2 * x2 + 2.0596336163223834e-2)) * x8 +
(3.0582043602875735e-2 * x2 + 4.4630538556294605e-2) * x4 +
(7.5000364034134126e-2 * x2 + 1.6666666300567365e-1)) * x2 * x + x;
}
/* relative error < 7e-12 on [-50000, 50000] */
double my_sin (double x)
{
double q, t;
int quadrant;
/* Cody-Waite style argument reduction */
q = my_rint (x * 6.3661977236758138e-1);
quadrant = (int)q;
t = x - q * 1.5707963267923333e+00;
t = t - q * 2.5633441515945189e-12;
if (quadrant & 1) {
t = cos_core(t);
} else {
t = sin_core(t);
}
return (quadrant & 2) ? -t : t;
}
/* relative error < 2e-11 on [-1, 1] */
double my_acos (double x)
{
double xa, t;
xa = fabs (x);
/* arcsin(x) = pi/2 - 2 * arcsin (sqrt ((1-x) / 2))
* arccos(x) = pi/2 - arcsin(x)
* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2))
*/
if (xa > 0.5625) {
t = 2.0 * asin_core (sqrt (0.5 * (1.0 - xa)));
} else {
t = 1.5707963267948966 - asin_core (xa);
}
/* arccos (-x) = pi - arccos(x) */
return (x < 0.0) ? (3.1415926535897932 - t) : t;
}

sin( acos(t1) + acos(t2) + ... + acos(tn) )
boils down to the calculation of
sin( acos(x) ) and cos(acos(x))=x
because
sin(a+b) = cos(a)sin(b)+sin(a)cos(b).
The first thing is
sin( acos(x) ) = sqrt(1-x*x)
Taylor series expansion for the sqrt reduces the problem to polynomial calculations.
To clarify, here's the expansion to n=2, n=3:
sin( acos(t1) + acos(t2) ) = sin(acos(t1))cos(acos(t2)) + sin(acos(t2))cos(acos(t1) = sqrt(1-t1*t1) * t2 + sqrt(1-t2*t2) * t1
cos( acos(t2) + acos(t3) ) = cos(acos(t2)) cos(acos(t3)) - sin(acos(t2))sin(acos(t3)) = t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)
sin( acos(t1) + acos(t2) + acos(t3)) =
sin(acos(t1))cos(acos(t2) + acos(t3)) + sin(acos(t2)+acos(t3) )cos(acos(t1)=
sqrt(1-t1*t1) * (t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)) + (sqrt(1-t2*t2) * t3 + sqrt(1-t3*t3) * t2 ) * t1
and so on.
The sqrt() for x in (-1,1) can be computed using
x_0 is some approximation, say, zero
x_(n+1) = 0.5 * (x_n + S/x_n) where S is the argument.
EDIT: I mean the "Babylonian method", see Wikipedia's article for details. You will need not more than 5-6 iterations to achieve 1e-10 with x in (0,1).

As Jonas Wielicki mentions in the comments, there isn't much precision trade-offs you can make.
Your best bet is to try and use the processor intrinsics for the functions (if your compiler doesn't do this already) and using some math to reduce the amount of calculations necessary.
Also very important is to keep everything in a CPU-friendly format, make sure there are few cache misses, etc.
If you are calculating large amounts of functions like acos perhaps moving to the GPU is an option for you?

You can try to create lookup tables, and use them instead of standard c++ functions, and see if you see any performance boost.

Significant gains can be made by aligning memory and streaming in the data to your kernel. Most often this dwarfs the gains that can be made by recreating the math functions. Think of how you can improve memory access to/from your kernel operator.
Memory access can be improved by using buffering techniques. This depends on your hardware platform. If you are running this on a DSP, you could DMA your data onto an L2 cache and schedule the instructions so that multiplier units are fully occupied.
If you are on general purpose CPU, most you can do is to use aligned data, feed the cache lines by prefetching. If you have nested loops, then the inner most loop should go back and forth (i.e. iterate forward and then iterate backward) so that cache lines are utilised, etc.
You could also think of ways to parallelize the computation using multiple cores. If you can use a GPU this could significantly improve performance (albeit with a lesser precision).

In addition to what others have said, here are some techniques at speed optimization:
Profile
Find out where in the code most of the time is spent.
Only optimize that area to gain the mose benefit.
Unroll Loops
The processors don't like branches or jumps or changes in the execution path. In general, the processor has to reload the instruction pipeline which uses up time that can be spent on calculations. This includes function calls.
The technique is to place more "sets" of operations in your loop and reduce the number of iterations.
Declare Variables as Register
Variables that are used frequently should be declared as register. Although many members of SO have stated compilers ignore this suggestion, I have found out otherwise. Worst case, you wasted some time typing.
Keep Intense Calculations Short & Simple
Many processors have enough room in their instruction pipelines to hold small for loops. This reduces the amount of time spent reloading the instruction pipeline.
Distribute your big calculation loop into many small ones.
Perform Work on Small Sections of Arrays & Matrices
Many processors have a data cache, which is ultra fast memory very close to the processor. The processor likes to load the data cache once from off-processor memory. More loads require time that can be spent making calculations. Search the web for "Data Oriented Design Cache".
Think in Parallel Processor Terms
Change the design of your calculations so they can be easily adaptable to use with multiple processors. Many CPUs have multiple cores that can execute instructions in parallel. Some processors have enough intelligence to automatically delegate instructions to their multiple cores.
Some compilers can optimize code for parallel processing (look up the compiler options for your compiler). Designing your code for parallel processing will make this optimization easier for the compiler.
Analyze Assembly Listing of Functions
Print out the assembly language listing of your function.
Change the design of your function to match that of the assembly language or to help the compiler generate more optimal assembly language.
If you really need more efficiency, optimize the assembly language and put in as inline assembly code or as a separate module. I generally prefer the latter.
Examples
In your situation, take first 10 terms of the Taylor expansion, calculate them separately and place into individual variables:
double term1, term2, term3, term4;
double n, n1, n2, n3, n4;
n = 1.0;
for (i = 0; i < 100; ++i)
{
n1 = n + 2;
n2 = n + 4;
n3 = n + 6;
n4 = n + 8;
term1 = 4.0/n;
term2 = 4.0/n1;
term3 = 4.0/n2;
term4 = 4.0/n3;
Then sum up all of your terms:
result = term1 - term2 + term3 - term4;
// Or try sorting by operation, if possible:
// result = term1 + term3;
// result -= term2 + term4;
n = n4 + 2;
}

Lets consider two terms first:
cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b)
or cos(a+b) = cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b))
Taking cos to the RHS
a+b = acos( cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b)) ) ... 1
Here cos(a) = t_1 and cos(b) = t_2
a = acos(t_1) and b = acos(t_2)
By substituting in equation (1), we get
acos(t_1) + acos(t_2) = acos(t_1*t_2 - sqrt(1 - t_1*t_1) * sqrt(1 - t_2*t_2))
Here you can see that you have combined two acos into one. So you can pair up all the acos recursively and form a binary tree. At the end, you'll be left with an expression of the form sin(acos(x)) which equals sqrt(1 - x*x).
This will improve the time complexity.
However, I'm not sure about the complexity of calculating sqrt().

Related

Does gcc optimize c++ code algebraically and if so to what extent?

Consider the following piece of code showing some simple arithmetic operations
int result = 0;
result = c * (a + b) + d * (a + b) + e;
To get the result in the expression above the cpu would need to execute two integer multiplications and three integer additions. However algebraically the above expression could be simplified to the code below.
result = (c + d) * (a + b) + e
The two expressions are algebraically identical however the second expression only contains one multiplication and three additions. Is gcc (or other compilers for that matter) able to make this simple optimization on their own.
Now assuming that the compiler is intelligent enough to make this simple optimization, would it be able to optimize something more complex such as the Trapezoidal rule (used for numerical integration). Example below approximates the area under sin(x) where 0 <= x <= pi with a step size of pi/4 (small for the sake of simplicity). Please assume all literals are runtime variables.
#include <math.h>
// Please assume all literals are runtime variables. Have done it this way to
// simplify the code.
double integral = 0.5 * ((sin(0) + sin(M_PI/4) * (M_PI/4 - 0) + (sin(M_PI/4) +
sin(M_PI/2)) * (M_PI/2 - M_PI/4) + (sin(M_PI/2) + sin(3 * M_PI/4)) *
(3 * M_PI/4 - M_PI/2) + (sin(3 * M_PI/4) + sin(M_PI)) * (M_PI - 3 * M_PI/4));
Now the above function could be written like this once simplified using the trapezoidal rule. This drastically reduces the number of multiplications/divisions needed to get the same answer.
integral = 0.5 * (1 / no_steps /* 4 in th case above*/) *
(M_PI - 0 /* Upper and lower limit*/) * (sin(0) + 2 * (sin(M_PI/4) +
sin(3 * M_PI/4)) + sin(M_PI));

GCC (and most C++ compilers, for that matter) does not refactor algebraic expressions.
This is mainly because as far as GCC and general software arithmetic is concerned, the lines
double x = 0.5 * (4.6 + 6.7);
double y = 0.5 * 4.6 + 0.5 * 6.7;
assert(x == y); //Will probably fail!
Are not guaranteed to be evaluate to the exact same number. GCC can't optimize these structures without that kind of guarantee.
Furthermore, order of operations can matter a lot. For example:
int x = y;
int z = (y / 16) * 16;
assert(x == z); //Will only be true if y is a whole multiple of 16
Algebraically, those two lines should be equivalent, right? But if y is an int, what it will actually do is make x equal to "y rounded to the lower whole multiple of 16". Sometimes, that's intended behavior (Like if you're byte aligning). Other times, it's a bug. The important thing is, both are valid computer code and both can be useful depending on the circumstances, and if GCC optimized around those structures, it would prevent programmers from having agency over their code.

Yes, optimizers, gcc's included, do optimizations of this type. Not necessarily the expression that you quoted exactly, or other arbitrarily complex expressions. But a simpler expresion, (a + a) - a is likely to be optimized to a for example. Another example of possible optimization is a*a*a*a to temp = a*a; (temp)*(temp)
Whether a given compiler optimizes the expressions that you quote, can be observed by reading the output assembly code.
No, this type of optimization is not used with floating points by default (unless maybe if the optimizer can prove that no accuracy is lost). See Are floating point operations in C associative? You can let for example gcc do this with -fassociative-math option. At your own peril.

Optimization of double subtraction in C++

I have the following code that I use to compute the distance between two vectors:
double dist(vector<double> & vecA, vector<double> & vecB){
double curDist = 0.0;
for (size_t i = 0; i < vecA.size(); i++){
double dif = vecA[i] - vecB[i];
curDist += dif * dif;
}
return curDist;
}
This function is a major bottleneck in my application since it relies on a lot of distance calculations, consuming more than 60% of CPU time on a typical input. Additionally, the following line:
double dif = vecA[i] - vecB[i];
is responsible for more than 77% of CPU time in this function. My question is: is it possible to somehow optimize this function?
Notes:
To profile my application I have used Intel Amplifier XE;
Reducing the number of distance computations is not a feasible solution for
me;

There are two possible issues I can think of right now:
This computation is memory bound.
There is an iteration-to-iteration dependency on curDist.
This computation is memory bound.
Your dataset is larger than your CPU cache. So in this case, no amount of optimization is going to help unless you can restructure your algorithm.
There is an iteration-to-iteration dependency on curDist.
You have a dependency on curDist. This will block vectorization by the compiler. (Also, don't always trust the profiler numbers to the line. They can be inaccurate especially after compiler optimizations.)
Normally, the compiler vectorizer can split up the curDist into multiple partial sums to and unroll/vectorize the loop. But it can't do that under strict-floating-point behavior. You can try relaxing your floating-point mode if you haven't already. Or you can split the sum and unroll it yourself.
For example, this kind of optimization is something the compiler can do with integers, but not necessarily with floating-point:
double curDist0 = 0.0;
double curDist1 = 0.0;
double curDist2 = 0.0;
double curDist3 = 0.0;
for (size_t i = 0; i < vecA.size() - 3; i += 4){
double dif0 = vecA[i + 0] - vecB[i + 0];
double dif1 = vecA[i + 1] - vecB[i + 1];
double dif2 = vecA[i + 2] - vecB[i + 2];
double dif3 = vecA[i + 3] - vecB[i + 3];
curDist0 += dif0 * dif0;
curDist1 += dif1 * dif1;
curDist2 += dif2 * dif2;
curDist3 += dif3 * dif3;
}
// Do some sort of cleanup in case (vecA.size() % 4 != 0)
double curDist = curDist0 + curDist1 + curDist2 + curDist3;

You could eliminate the call to vecA.size() for each iteration of the loop, just call it once before the loop. You could also do loop unrolling to give yourself more computation per loop iteration. What compiler are you using, and what optimization settings? Compiler will often do unrolling for you, but you could manually do it.

If it's feasible (if the range of the numbers isn't huge) you may want to explore using fixed point to store these numbers, rather than doubles.
Fixed point would turn these into int operations rather than double operations.
Another interesting thing is that assuming your profile is correct, the lookups seems to be a significant factor (otherwise the multiplication would likely be more costly than the subtractions).
I'd try using a const vector iterator rather than the random access lookup. It may help in two ways: 1 - it is constant, and 2 - the serial nature of the iterator may let the processor do better caching.

If your platform does not have (or is not using) an ALU that supports floating point math, floating point libraries, by nature, are slow and consume additional non-volatile memory. I suggest instead using 32-bit (long) or 64-bit (long long) fixed-point arithmetic. Then convert the final result to floating point at the end of the algorithm. I did this on a project a couple years ago to improve the performance of an I2T algorithm and it worked wonderfully.

Fast equivalent to sin() for DSP referenced in STK

I'm using bits of Perry Cook's Synthesis Toolkit (STK) to generate saw and square waves. STK includes this BLIT-based sawtooth oscillator:
inline STKFloat BlitSaw::tick( void ) {
StkFloat tmp, denominator = sin( phase_ );
if ( fabs(denominator) <= std::numeric_limits<StkFloat>::epsilon() )
tmp = a_;
else {
tmp = sin( m_ * phase_ );
tmp /= p_ * denominator;
}
tmp += state_ - C2_;
state_ = tmp * 0.995;
phase_ += rate_;
if ( phase_ >= PI )
phase_ -= PI;
lastFrame_[0] = tmp;
return lastFrame_[0];
}
The square wave oscillator is broadly similar. At the top, there's this comment:
// A fully optimized version of this code would replace the two sin
// calls with a pair of fast sin oscillators, for which stable fast
// two-multiply algorithms are well known.
I don't know where to start looking for these "fast two-multiply algorithms" and I'd appreciate some pointers. I could use a lookup table instead, but I'm keen to learn what these 'fast sin oscillators' are. I could also use an abbreviated Taylor series, but thats way more than two multiplies. Searching hasn't turned up anything much, although I did find this approximation:
#define AD_SIN(n) (n*(2.f- fabs(n)))
Plotting it out shows that it's not really a close approximation outside the range of -1 to 1, so I don't think I can use it when phase_ is in the range -pi to pi:
Here, Sine is the blue line and the purple line is the approximation.
Profiling my code reveals that the calls to sin() are far and away the most time-consuming calls, so I really would like to optimise this piece.
Thanks
EDIT Thanks for the detailed and varied answers. I will explore these and accept one at the weekend.
EDIT 2 Would the anonymous close voter please kindly explain their vote in the comments? Thank you.

Essentially the sinusoidal oscilator is one (or more) variables that change with each DSP step, rather than getting recalculated from scratch.
The simplest are based on the following trig identities: (where d is constant, and thus so is cos(d) and sin(d) )
sin(x+d) = sin(x) cos(d) + cos(x) sin(d)
cos(x+d) = cos(x) cos(d) - sin(x) sin(d)
However this requires two variables (one for sin and one for cos) and 4 multiplications to update. However this will still be far faster than calculating a full sine at each step.
The solution by Oli Charlesworth is based on solutions to this general equation
A_{n+1} = a A_{n} + A_{n-1}
Where looking for a solution of the form A_n = k e^(i theta n) gives an equation for theta.
e^(i theta (n+1) ) = a e^(i theta n ) + b e^(i theta (n-1) )
Which simplifies to
e^(i theta) - e^(-i theta ) = a
2 cos(theta) = a
Giving
A_{n+1} = 2 cos(theta) A_{n} + A_{n-1}
Whichever approach you use you'll either need to use one or two of these oscillators for each frequency, or use another trig identity to derive the higher or lower frequencies.

How accurate do you need this?
This function, f(x)=0.398x*(3.1076-|x|), does a reasonably good job for x between -pi and pi.
Edit
An even better approximation is f(x)=0.38981969947653056*(pi-|x|), which keeps the absolute error to 0.038158444604 or less for x between -pi and pi.
A least squares minimization will yield a slightly different function.

It's not possible to generate one-off sin calls with just two multiplies (well, not a useful approximation, at any rate). But it is possible to generate an oscillator with low complexity, i.e. where each value is calculated in terms of the preceding ones.
For instance, consider that the following difference equation will give you a sinusoid:
y[n] = 2*cos(phi)*y[n-1] - y[n-2]
(where cos(phi) is a constant)

(From the original author of the VST BLT code).
As a matter of fact, I was porting the VST BLT oscillators to C#, so I was googling for good sin oscillators. Here's what I came up with. Translation to C++ is straightforward. See the notes at the end about accuumulated round-off errors.
public class FastOscillator
{
private double b1;
private double y1, y2;
private double fScale;
public void Initialize(int sampleRate)
{
fScale = AudioMath.TwoPi / sampleRate;
}
// frequency in Hz. phase in radians.
public void Start(float frequency, double phase)
{
double w = frequency * fScale;
b1 = 2.0 * Math.Cos(w);
y1 = Math.Sin(phase - w);
y2 = Math.Sin(phase - w * 2);
}
public double Tick()
{
double y0 = b1 * y1 - y2;
y2 = y1;
y1 = y0;
return y0;
}
}
Note that this particular oscillator implementation will drift over time, so it needs to be re-initialzed periodically. In this particular implementation, the magnitude of the sin wave decays over time. The original comments in the STK code suggested a two-multiply oscillator. There are, in fact, two-multiply oscillators that are reasonably stable over time. But in retrospect, the need to keep the sin(phase), and sin(m*phase) oscillators tightly in synch probably means that they have to be resynched anyway. Round-off errors between phase and m*phase mean that even if the oscillators were stable, they would drift eventually, running a significant risk of producing large spikes in values near the zeros of the BLT functions. May as well use a one-multiply oscillator.
These particular oscillators should probably be re-initialized every 30 to 100 cycles (or so). My C# implementation is frame based (i.e. it calculates an float[] array of results in a void Tick(int count, float[] result) method. The oscillators are re-synched at the end of each Tick call. Something like this:
void Tick(int count, float[] result)
{
for (int i = 0; i < count; ++i)
{
...
result[i] = bltResult;
}
// re-initialize the oscillators to avoid accumulated drift.
this.phase = (this.phase + this.dPhase*count) % AudioMath.TwoPi;
this.sinOsc.Initialize(frequency,this.phase);
this.mSinOsc.Initialize(frequency*m,this.phase*m);
}
Probably missing from the STK code. You might want to investigate this. The original code provided to the STK did this. Gary Scavone tweaked the code a bit, and I think the optimization was lost. I do know that the STK implementations suffer from DC drift, which can be almost entirely eliminated when implemented properly.
There's a peculiar hack that prevents DC drift of the oscillators, even when sweeping the frequency of the oscillators. The trick is that the oscillators should be started with an initial phase adjustment of dPhase/2. That just so happens to start the oscillators off with zero DC drift, without having to figure out wat the correct initial state for various integrators in each of the BLT oscillators.
Strangely, if the adjustment is re-adjusted whenever the frequency of the oscillator changes, then this also prevents wild DC drift of the output when sweeping the frequency of the oscillator. Whenever the frequency changes, subtract dPhase/2 from the previous phase value, recalculate dPhase for the new frequency, and then add dPhase/2.I rather suspect this could be formally proven; but I have not been able to so. All I know is that It Just Works.
For a block implementation, the oscillators should actually be initialized as follows, instead of carrying the phase adjustment in the current this.phase value.
this.sinOsc.Initialize(frequency,phase+dPhase*0.5);
this.mSinOsc.Initialize(frequency*m,(phase+dPhase*0.5)*m);

You might want to take a look here:
http://devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine/
There's some sample code that calculates a very good appoximation of sin/cos using only multiplies, additions and the abs() function. Quite fast too. The comments are also a good read.
It essentiall boils down to this:
float sine(float x)
{
const float B = 4/pi;
const float C = -4/(pi*pi);
const float P = 0.225;
float y = B * x + C * x * abs(x);
return P * (y * abs(y) - y) + y;
}
and works for a range of -PI to PI

If you can, you should consider memorization based techniques. Essentially store sin(x) and cos(x) values for a bunch values. To calculate sin(y), find a and b for which precomputed values exist such that a<=y<=b. Now using sin(a), sin(b), cos(a), cos(b), y-a and y-b approximately calculate sin(y).

The general idea of getting periodically sampled results from the sine or cosine function is to use a trig recursion or an initialized (barely) stable IIR filter (which can end up being pretty much the same computations). There are bunches of these in the DSP literature, of varying accuracy and stability. Choose carefully.

sin and cos are slow, is there an alternatve?

My game needs to move by a certain angle. To do this I get the vector of the angle via sin and cos. Unfortunately sin and cos are my bottleneck. I'm sure I do not need this much precision. Is there an alternative to a C sin & cos and look-up table that is decently precise but very fast?
I had found this:
float Skeleton::fastSin( float x )
{
const float B = 4.0f/pi;
const float C = -4.0f/(pi*pi);
float y = B * x + C * x * abs(x);
const float P = 0.225f;
return P * (y * abs(y) - y) + y;
}
Unfortunately, this does not seem to work. I get significantly different behavior when I use this sin rather than C sin.
Thanks

A lookup table is the standard solution. You could Also use two lookup tables on for degrees and one for tenths of degrees and utilize sin(A + B) = sin(a)cos(b) + cos(A)sin(b)

For your fastSin(), you should check its documentation to see what range it's valid on. The units you're using for your game could be too big or too small and scaling them to fit within that function's expected range could make it work better.
EDIT:
Someone else mentioned getting it into the desired range by subtracting PI, but apparently there's a function called fmod for doing modulus division on floats/doubles, so this should do it:
#include <iostream>
#include <cmath>
float fastSin( float x ){
x = fmod(x + M_PI, M_PI * 2) - M_PI; // restrict x so that -M_PI < x < M_PI
const float B = 4.0f/M_PI;
const float C = -4.0f/(M_PI*M_PI);
float y = B * x + C * x * std::abs(x);
const float P = 0.225f;
return P * (y * std::abs(y) - y) + y;
}
int main() {
std::cout << fastSin(100.0) << '\n' << std::sin(100.0) << std::endl;
}
I have no idea how expensive fmod is though, so I'm going to try a quick benchmark next.
Benchmark Results
I compiled this with -O2 and ran the result with the Unix time program:
int main() {
float a = 0;
for(int i = 0; i < REPETITIONS; i++) {
a += sin(i); // or fastSin(i);
}
std::cout << a << std::endl;
}
The result is that sin is about 1.8x slower (if fastSin takes 5 seconds, sin takes 9). The accuracy also seemed to be pretty good.
If you chose to go this route, make sure to compile with optimization on (-O2 in gcc).

I know this is already an old topic, but for people who have the same question, here is a tip.
A lot of times in 2D and 3D rotation, all vectors are rotated with a fixed angle. In stead of calling the cos() or sin() every cycle of the loop, create variable before the loop which contains the value of cos(angle) or sin(angle) already. You can use this variable in your loop. This way the function only has to be called once.

If you rephrase the return in fastSin as
return (1-P) * y + P * (y * abs(y))
And rewrite y as (for x>0 )
y = 4 * x * (pi-x) / (pi * pi)
you can see that y is a parabolic first-order approximation to sin(x) chosen so that it passes through (0,0), (pi/2,1) and (pi,0), and is symmetrical about x=pi/2.
Thus we can only expect our function to be a good approximation from 0 to pi. If we want values outside that range we can use the 2-pi periodicity of sin(x) and that sin(x+pi) = -sin(x).
The y*abs(y) is a "correction term" which also passes through those three points. (I'm not sure why y*abs(y) is used rather than just y*y since y is positive in the 0-pi range).
This form of overall approximation function guarantees that a linear blend of the two functions y and y*y, (1-P)*y + P * y*y will also pass through (0,0), (pi/2,1) and (pi,0).
We might expect y to be a decent approximation to sin(x), but the hope is that by picking a good value for P we get a better approximation.
One question is "How was P chosen?". Personally, I'd chose the P that produced the least RMS error over the 0,pi/2 interval. (I'm not sure that's how this P was chosen though)
Minimizing this wrt. P gives
This can be rearranged and solved for p
Wolfram alpha evaluates the initial integral to be the quadratic
E = (16 π^5 p^2 - (96 π^5 + 100800 π^2 - 967680)p + 651 π^5 - 20160 π^2)/(1260 π^4)
which has a minimum of
min(E) = -11612160/π^9 + 2419200/π^7 - 126000/π^5 - 2304/π^4 + 224/π^2 + (169 π)/420
≈ 5.582129689596371e-07
at
p = 3 + 30240/π^5 - 3150/π^3
≈ 0.2248391013559825
Which is pretty close to the specified P=0.225.
You can raise the accuracy of the approximation by adding an additional correction term. giving a form something like return (1-a-b)*y + a y * abs(y) + b y * y * abs(y). I would find a and b by in the same way as above, this time giving a system of two linear equations in a and b to solve, rather than a single equation in p. I'm not going to do the derivation as it is tedious and the conversion to latex images is painful... ;)
NOTE: When answering another question I thought of another valid choice for P.
The problem is that using reflection to extend the curve into (-pi,0) leaves a kink in the curve at x=0. However, I suspect we can choose P such that the kink becomes smooth.
To do this take the left and right derivatives at x=0 and ensure they are equal. This gives an equation for P.

You can compute a table S of 256 values, from sin(0) to sin(2 * pi). Then, to pick sin(x), bring back x in [0, 2 * pi], you can pick 2 values S[a], S[b] from the table, such as a < x < b. From this, linear interpolation, and you should have a fair approximation
memory saving trick : you actually need to store only from [0, pi / 2], and use symmetries of sin(x)
enhancement trick : linear interpolation can be a problem because of non-smooth derivatives, humans eyes is good at spotting such glitches in animation and graphics. Use cubic interpolation then.

What about
x*(0.0174532925199433-8.650935142277599*10^-7*x^2)
for deg and
x*(1-0.162716259904269*x^2)
for rad on -45, 45 and -pi/4 , pi/4 respectively?

This (i.e. the fastsin function) is approximating the sine function using a parabola. I suspect it's only good for values between -π and +π. Fortunately, you can keep adding or subtracting 2π until you get into this range. (Edited to specify what is approximating the sine function using a parabola.)

you can use this aproximation.
this solution use a quadratic curve :
http://www.starming.com/index.php?action=plugin&v=wave&ajax=iframe&iframe=fullviewonepost&mid=56&tid=4825

Create sine lookup table in C++

How can I rewrite the following pseudocode in C++?
real array sine_table[-1000..1000]
for x from -1000 to 1000
sine_table[x] := sine(pi * x / 1000)
I need to create a sine_table lookup table.

You can reduce the size of your table to 25% of the original by only storing values for the first quadrant, i.e. for x in [0,pi/2].
To do that your lookup routine just needs to map all values of x to the first quadrant using simple trig identities:
sin(x) = - sin(-x), to map from quadrant IV to I
sin(x) = sin(pi - x), to map from quadrant II to I
To map from quadrant III to I, apply both identities, i.e. sin(x) = - sin (pi + x)
Whether this strategy helps depends on how much memory usage matters in your case. But it seems wasteful to store four times as many values as you need just to avoid a comparison and subtraction or two during lookup.
I second Jeremy's recommendation to measure whether building a table is better than just using std::sin(). Even with the original large table, you'll have to spend cycles during each table lookup to convert the argument to the closest increment of pi/1000, and you'll lose some accuracy in the process.
If you're really trying to trade accuracy for speed, you might try approximating the sin() function using just the first few terms of the Taylor series expansion.
sin(x) = x - x^3/3! + x^5/5! ..., where ^ represents raising to a power and ! represents the factorial.
Of course, for efficiency, you should precompute the factorials and make use of the lower powers of x to compute higher ones, e.g. use x^3 when computing x^5.
One final point, the truncated Taylor series above is more accurate for values closer to zero, so its still worthwhile to map to the first or fourth quadrant before computing the approximate sine.
Addendum:
Yet one more potential improvement based on two observations:
1. You can compute any trig function if you can compute both the sine and cosine in the first octant [0,pi/4]
2. The Taylor series expansion centered at zero is more accurate near zero
So if you decide to use a truncated Taylor series, then you can improve accuracy (or use fewer terms for similar accuracy) by mapping to either the sine or cosine to get the angle in the range [0,pi/4] using identities like sin(x) = cos(pi/2-x) and cos(x) = sin(pi/2-x) in addition to the ones above (for example, if x > pi/4 once you've mapped to the first quadrant.)
Or if you decide to use a table lookup for both the sine and cosine, you could get by with two smaller tables that only covered the range [0,pi/4] at the expense of another possible comparison and subtraction on lookup to map to the smaller range. Then you could either use less memory for the tables, or use the same memory but provide finer granularity and accuracy.

long double sine_table[2001];
for (int index = 0; index < 2001; index++)
{
sine_table[index] = std::sin(PI * (index - 1000) / 1000.0);
}

One more point: calling trigonometric functions is pricey. if you want to prepare the lookup table for sine with constant step - you may save the calculation time, in expense of some potential precision loss.
Consider your minimal step is "a". That is, you need sin(a), sin(2a), sin(3a), ...
Then you may do the following trick: First calculate sin(a) and cos(a). Then for every consecutive step use the following trigonometric equalities:
sin([n+1] * a) = sin(n*a) * cos(a) + cos(n*a) * sin(a)
cos([n+1] * a) = cos(n*a) * cos(a) - sin(n*a) * sin(a)
The drawback of this method is that during this procedure the round-off error is accumulated.

double table[1000] = {0};
for (int i = 1; i <= 1000; i++)
{
sine_table[i-1] = std::sin(PI * i/ 1000.0);
}
double getSineValue(int multipleOfPi){
if(multipleOfPi == 0) return 0.0;
int sign = 1;
if(multipleOfPi < 0){
sign = -1;
}
return signsine_table[signmultipleOfPi - 1];
}
You can reduce the array length to 500, by a trick sin(pi/2 +/- angle) = +/- cos(angle).
So store sin and cos from 0 to pi/4.
I don't remember from top of my head but it increased the speed of my program.

You'll want the std::sin() function from <cmath>.

another approximation from a book or something
streamin ramp;
streamout sine;
float x,rect,k,i,j;
x = ramp -0.5;
rect = x * (1 - x < 0 & 2);
k = (rect + 0.42493299) *(rect -0.5) * (rect - 0.92493302) ;
i = 0.436501 + (rect * (rect + 1.05802));
j = 1.21551 + (rect * (rect - 2.0580201));
sine = i*j*k*60.252201*x;
full discussion here:
http://synthmaker.co.uk/forum/viewtopic.php?f=4&t=6457&st=0&sk=t&sd=a
I presume that you know, that using a division is a lot slower than multiplying by decimal number, /5 is always slower than *0.2
it's just an approximation.
also:
streamin ramp;
streamin x; // 1.5 = Saw 3.142 = Sin 4.5 = SawSin
streamout sine;
float saw,saw2;
saw = (ramp * 2 - 1) * x;
saw2 = saw * saw;
sine = -0.166667 + saw2 * (0.00833333 + saw2 * (-0.000198409 + saw2 * (2.7526e-006+saw2 * -2.39e-008)));
sine = saw * (1+ saw2 * sine);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js