More efficient way to do this in GLSL

More efficient way to do this in GLSL - opengl

x += offset * vec3(notEqual(a, greaterThanEqual(fract(b), vec3(0.5))));
x and b are vector3, and a is bvec3.
This seems fairly expensive and i'm wondering if there is another way to do it. Basically I want to offset x component wise by a fixed amount depending on if b's fractional component is above 0.5, and if a is 1 or 0 (true or false). If it's 1 and <0.5, do an offset, if it's 0 and >0.5 do an offset, like xor, I use notEqual for xor here.

I can't think of anything massively better. This one looks slightly simpler than what you have:
x += offset * vec3(notEqual(vec3(a), round(fract(b))));
Or similarly:
x += offset * vec3(notEqual(vec3(a), step(0.5, fract(b))));
If the notEqual() is expensive, the xor-type operation could be replaced by using a sum modulo 2:
x += offset * mod(vec3(a) + round(fract(b)), 2.0);
Or in a similar spirit, but avoiding the mod() at the price of a few more basic operations:
vec3 af = vec3(a);
vec3 brf = round(fract(b));
x += offset * (af + brf - 2.0 * af * brf);
There's probably countless more variations and permutations of similar ideas. As was already suggested in the comments, there's almost no way around benchmarking them on a good cross-section of the hardware you care about.

Related

Does gcc optimize c++ code algebraically and if so to what extent?

Consider the following piece of code showing some simple arithmetic operations
int result = 0;
result = c * (a + b) + d * (a + b) + e;
To get the result in the expression above the cpu would need to execute two integer multiplications and three integer additions. However algebraically the above expression could be simplified to the code below.
result = (c + d) * (a + b) + e
The two expressions are algebraically identical however the second expression only contains one multiplication and three additions. Is gcc (or other compilers for that matter) able to make this simple optimization on their own.
Now assuming that the compiler is intelligent enough to make this simple optimization, would it be able to optimize something more complex such as the Trapezoidal rule (used for numerical integration). Example below approximates the area under sin(x) where 0 <= x <= pi with a step size of pi/4 (small for the sake of simplicity). Please assume all literals are runtime variables.
#include <math.h>
// Please assume all literals are runtime variables. Have done it this way to
// simplify the code.
double integral = 0.5 * ((sin(0) + sin(M_PI/4) * (M_PI/4 - 0) + (sin(M_PI/4) +
sin(M_PI/2)) * (M_PI/2 - M_PI/4) + (sin(M_PI/2) + sin(3 * M_PI/4)) *
(3 * M_PI/4 - M_PI/2) + (sin(3 * M_PI/4) + sin(M_PI)) * (M_PI - 3 * M_PI/4));
Now the above function could be written like this once simplified using the trapezoidal rule. This drastically reduces the number of multiplications/divisions needed to get the same answer.
integral = 0.5 * (1 / no_steps /* 4 in th case above*/) *
(M_PI - 0 /* Upper and lower limit*/) * (sin(0) + 2 * (sin(M_PI/4) +
sin(3 * M_PI/4)) + sin(M_PI));

GCC (and most C++ compilers, for that matter) does not refactor algebraic expressions.
This is mainly because as far as GCC and general software arithmetic is concerned, the lines
double x = 0.5 * (4.6 + 6.7);
double y = 0.5 * 4.6 + 0.5 * 6.7;
assert(x == y); //Will probably fail!
Are not guaranteed to be evaluate to the exact same number. GCC can't optimize these structures without that kind of guarantee.
Furthermore, order of operations can matter a lot. For example:
int x = y;
int z = (y / 16) * 16;
assert(x == z); //Will only be true if y is a whole multiple of 16
Algebraically, those two lines should be equivalent, right? But if y is an int, what it will actually do is make x equal to "y rounded to the lower whole multiple of 16". Sometimes, that's intended behavior (Like if you're byte aligning). Other times, it's a bug. The important thing is, both are valid computer code and both can be useful depending on the circumstances, and if GCC optimized around those structures, it would prevent programmers from having agency over their code.

Yes, optimizers, gcc's included, do optimizations of this type. Not necessarily the expression that you quoted exactly, or other arbitrarily complex expressions. But a simpler expresion, (a + a) - a is likely to be optimized to a for example. Another example of possible optimization is a*a*a*a to temp = a*a; (temp)*(temp)
Whether a given compiler optimizes the expressions that you quote, can be observed by reading the output assembly code.
No, this type of optimization is not used with floating points by default (unless maybe if the optimizer can prove that no accuracy is lost). See Are floating point operations in C associative? You can let for example gcc do this with -fassociative-math option. At your own peril.

Fast way to convert ARGB texture to next power of 2 texture

We have some old devices that don't support non-pot textures and we have a function that converts ARGB textures to next power of 2 texture. The problem is that it's quite slow and we're wondering if there is a better approach to convert these textures.
void PotTexture()
{
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
std::vector<unsigned char> pottedImageData;
pottedImageData.resize(u2 * v2 * 4);
size_t y, x, c;
for (y = 0; y < imageData.height; y++)
{
for (x = 0; x < imageData.width; x++)
{
for (c = 0; c < 4; c++)
{
pottedImageData[4 * u2 * y + 4 * x + c] = imageData.convertedData[4 * imageData.width * y + 4 * x + c];
}
}
}
imageData.width = u2;
imageData.height = v2;
std::swap(imageData.convertedData, pottedImageData);
}
On some devices this can easily use 100% of the CPU so any optimizations would be amazing. Are there any existing functions that I could look at that perform this conversion?
Edit:
I've optimized the above loop slightly to:
for (y = 0; y < imageData.height; y++)
{
memcpy(
&(pottedImageData[y * u2 * 4]),
&(imageData.convertedData[y * imageData.width * 4]),
imageData.width * 4);
}

Even devices that don't support NPOT texture should support NPOT load.
Create the texture as an exact power of 2 and NO CONTENT using glTexImage2D, passing a null pointer for data.
data may be a null pointer. In this case, texture memory is allocated to accommodate a texture of width width and height height. You can then download subtextures to initialize this texture memory. The image is undefined if the user tries to apply an uninitialized portion of the texture image to a primitive.
Then use glTexSubImage2D to upload a NPOT image, which occupies only a portion of the total texture. This can be done without any CPU-side image rearrangement.

Having had a similar problem in a program I wrote, I took a very different approach. Rather than stretch the source texture, I just copied it into the top left corner of an otherwise empty power-of-two texture.
Then in the pixel shader you use a pair of floats to adjust s,t values so you fetch from just the top left corner.
float sAdjust = static_cast<float>(textureWidth) / static_cast<float>(containerWidth)
float tAdjust = static_cast<float>(textureHeight) / static_cast<float>(containerHeight)
is how you compute them, and to use them you'll get a Vec2 holding the s,t coordinates, just multiply s by sAdjust and t by tAdjust before using them to fetch. If you're using Direct3D, it'd be something akin to this:
D3DXVECTOR4 stAdjust;
stAdjust.x = sAdjust;
stAdjust.y = tAdjust;
// Transfer stAdjust into a float4 inside your pixel shader, call it stAdjust in there
Now in the pixel shader assume you have:
float2 texcoord;
float4 stAdjust;
you just say:
texcoord.x = texcoord.x * stAdjust.x;
texcoord.y = texcoord.y * stAdjust.y;
before using texcoord. Sorry I can't tell you how to do this in GLSL, but you get the general idea.

Okay, the very first optimization can be done here:
size_t u2 = 1; while (u2 < imageData.width) u2 *= 2;
size_t v2 = 1; while (v2 < imageData.height) v2 *= 2;
What you want to do is (for each dimension) find the floor of the logarithm-base2 (log2) and put that into 2**n+1. The standard math library has function log2 but it operates on floating point. But we can use is. 2**n can be written as 1 << n. So this gives
size_t const dim_p2_… = 1 << (int)floor(log2(dim_…)+1);
Better but not yet ideal, because of that float conversion. The Bit Twiddling hacks document has a few functions for integer ilog2: https://graphics.stanford.edu/~seander/bithacks.html#IntegerLog
But we're still not optimal. Let me introduce you to Compiler intrinsics, which translate into machine instructions, if the machine in question can do it on the metal.
GNU GCC: int __builtin_ffs (unsigned int x), which returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.
MSVC++: _BitScanReverse, which returns the length of the run of the most significant bits set zero. So _BitScanReverse is like builtin_ffs - 1 (there's also a builtin_clz which behaves exactly like BitScanReverse.
So we can do
#define ilog2_p1(x) (__builtin_ffs(x))
or
#define ilog2_p1(x) (__BitScanReverse(x)+1)
and use that.
size_t const dim_p2_… = 1 << (int)floor(ilog2_p1(dim_…));
While we're at bit twiddling: We can save that whole ordeal if a texture is already in power of two format. A few years ago I (independently) rediscovered the wonderfully portable bit twiddling trick, exploiting the properties of complement-2 integers. You can also find it in the bit twiddles document. But the type neutral, concise macro form is rarely seen. So here it is:
#define ISPOW2(x) ( (x) && !( (x) & ((x) - 1) ) )
You're using C++ so templates are in order:
template<typename T> bool ispow2(T const x) { return x && !( x & (x - 1) ); }
Then Ben Voight already told you, how to use glTexSubImage2D to load that into the texture. Also have a look at the GL_ARB_texture_rectangle extension, that allows to load NPOT textures, but without the ability for mipmapping and advanced filtering. But it might be a viable choice for you.
If you ever feel the need to scale the texture it's always worth looking into dual spaces. In this case the spatial frequency domain dual space. Upscaling a signal is essentially a impulse response. As such it can be described as a convolution. Convolutions usually are O(n²) in complexity. But due to the Fourier Convolution theorem in Fourier space the equivalent is simple multiplication, so it becomes O(n). FFT can be done with O(n log n), so the total complexity is about O(n + 2n log n), which is much better.

C++ (and maths) : fast approximation of a trigonometric function

I know this is a recurring question, but I haven't really found a useful answer yet. I'm basically looking for a fast approximation of the function acos in C++, I'd like to know if I can significantly beat the standard one.
But some of you might have insights on my specific problem: I'm writing a scientific program which I need to be very fast. The complexity of the main algorithm boils down to computing the following expression (many times with different parameters):
sin( acos(t_1) + acos(t_2) + ... + acos(t_n) )
where the t_i are known real (double) numbers, and n is very small (like smaller than 6). I need a precision of at least 1e-10. I'm currently using the standard sin and acos C++ functions.
Do you think I can significantly gain speed somehow? For those of you who know some maths, do you think it would be smart to expand that sine in order to get an algebraic expression in terms of the t_i (only involving square roots)?
Thank you your your answers.

The code below provides simple implementations of sin() and acos() that should satisfy your accuracy requirements and that you might want to try. Please note that the math library implementation on your platform is very likely highly tuned for the specific hardware capabilities of that platform and is probably also coded in assembly for maximum efficiency, so simple compiled C code not catering to specifics of the hardware is unlikely to provide higher performance, even when the accuracy requirements are somewhat relaxed from full double precision. As Viktor Latypov points out, it may also be worthwhile to search for algorithmic alternatives that do not require expensive calls to transcendental math functions.
In the code below I have tried to stick to simple, portable constructs. If your compiler supports the rint() function [specified by C99 and C++11] you might want to use that instead of my_rint(). On some platforms, the call to floor() can be expensive since it requires dynamic changing of machine state. The functions my_rint(), sin_core(), cos_core(), and asin_core() would want to be inlined for best performance. Your compiler may do that automatically at high optimization levels (e.g. when compiling with -O3), or you could add an appropriate inlining attribute to these functions, e.g. inline or __inline depending on your toolchain.
Not knowing anything about your platform I opted for simple polynomial approximations, which are evaluated using Estrin's scheme plus Horner's scheme. See Wikipedia for a description of these evaluation schemes:
http://en.wikipedia.org/wiki/Estrin%27s_scheme ,
http://en.wikipedia.org/wiki/Horner_scheme
The approximations themselves are of the minimax type and were custom generated for this answer with the Remez algorithm:
http://en.wikipedia.org/wiki/Minimax_approximation_algorithm ,
http://en.wikipedia.org/wiki/Remez_algorithm
The identities used in the argument reduction for acos() are noted in the comments, for sin() I used a Cody/Waite-style argument reduction, as described in the following book:
W. J. Cody, W. Waite, Software Manual for the Elementary Functions. Prentice-Hall, 1980
The error bounds mentioned in the comments are approximate, and have not been rigorously tested or proven.
/* not quite rint(), i.e. results not properly rounded to nearest-or-even */
double my_rint (double x)
{
double t = floor (fabs(x) + 0.5);
return (x < 0.0) ? -t : t;
}
/* minimax approximation to cos on [-pi/4, pi/4] with rel. err. ~= 7.5e-13 */
double cos_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using Estrin's scheme */
return (-2.7236370439787708e-7 * x2 + 2.4799852696610628e-5) * x8 +
(-1.3888885054799695e-3 * x2 + 4.1666666636943683e-2) * x4 +
(-4.9999999999963024e-1 * x2 + 1.0000000000000000e+0);
}
/* minimax approximation to sin on [-pi/4, pi/4] with rel. err. ~= 5.5e-12 */
double sin_core (double x)
{
double x4, x2, t;
x2 = x * x;
x4 = x2 * x2;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return ((2.7181216275479732e-6 * x2 - 1.9839312269456257e-4) * x4 +
(8.3333293048425631e-3 * x2 - 1.6666666640797048e-1)) * x2 * x + x;
}
/* minimax approximation to arcsin on [0, 0.5625] with rel. err. ~= 1.5e-11 */
double asin_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return (((4.5334220547132049e-2 * x2 - 1.1226216762576600e-2) * x4 +
(2.6334281471361822e-2 * x2 + 2.0596336163223834e-2)) * x8 +
(3.0582043602875735e-2 * x2 + 4.4630538556294605e-2) * x4 +
(7.5000364034134126e-2 * x2 + 1.6666666300567365e-1)) * x2 * x + x;
}
/* relative error < 7e-12 on [-50000, 50000] */
double my_sin (double x)
{
double q, t;
int quadrant;
/* Cody-Waite style argument reduction */
q = my_rint (x * 6.3661977236758138e-1);
quadrant = (int)q;
t = x - q * 1.5707963267923333e+00;
t = t - q * 2.5633441515945189e-12;
if (quadrant & 1) {
t = cos_core(t);
} else {
t = sin_core(t);
}
return (quadrant & 2) ? -t : t;
}
/* relative error < 2e-11 on [-1, 1] */
double my_acos (double x)
{
double xa, t;
xa = fabs (x);
/* arcsin(x) = pi/2 - 2 * arcsin (sqrt ((1-x) / 2))
* arccos(x) = pi/2 - arcsin(x)
* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2))
*/
if (xa > 0.5625) {
t = 2.0 * asin_core (sqrt (0.5 * (1.0 - xa)));
} else {
t = 1.5707963267948966 - asin_core (xa);
}
/* arccos (-x) = pi - arccos(x) */
return (x < 0.0) ? (3.1415926535897932 - t) : t;
}

sin( acos(t1) + acos(t2) + ... + acos(tn) )
boils down to the calculation of
sin( acos(x) ) and cos(acos(x))=x
because
sin(a+b) = cos(a)sin(b)+sin(a)cos(b).
The first thing is
sin( acos(x) ) = sqrt(1-x*x)
Taylor series expansion for the sqrt reduces the problem to polynomial calculations.
To clarify, here's the expansion to n=2, n=3:
sin( acos(t1) + acos(t2) ) = sin(acos(t1))cos(acos(t2)) + sin(acos(t2))cos(acos(t1) = sqrt(1-t1*t1) * t2 + sqrt(1-t2*t2) * t1
cos( acos(t2) + acos(t3) ) = cos(acos(t2)) cos(acos(t3)) - sin(acos(t2))sin(acos(t3)) = t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)
sin( acos(t1) + acos(t2) + acos(t3)) =
sin(acos(t1))cos(acos(t2) + acos(t3)) + sin(acos(t2)+acos(t3) )cos(acos(t1)=
sqrt(1-t1*t1) * (t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)) + (sqrt(1-t2*t2) * t3 + sqrt(1-t3*t3) * t2 ) * t1
and so on.
The sqrt() for x in (-1,1) can be computed using
x_0 is some approximation, say, zero
x_(n+1) = 0.5 * (x_n + S/x_n) where S is the argument.
EDIT: I mean the "Babylonian method", see Wikipedia's article for details. You will need not more than 5-6 iterations to achieve 1e-10 with x in (0,1).

As Jonas Wielicki mentions in the comments, there isn't much precision trade-offs you can make.
Your best bet is to try and use the processor intrinsics for the functions (if your compiler doesn't do this already) and using some math to reduce the amount of calculations necessary.
Also very important is to keep everything in a CPU-friendly format, make sure there are few cache misses, etc.
If you are calculating large amounts of functions like acos perhaps moving to the GPU is an option for you?

You can try to create lookup tables, and use them instead of standard c++ functions, and see if you see any performance boost.

Significant gains can be made by aligning memory and streaming in the data to your kernel. Most often this dwarfs the gains that can be made by recreating the math functions. Think of how you can improve memory access to/from your kernel operator.
Memory access can be improved by using buffering techniques. This depends on your hardware platform. If you are running this on a DSP, you could DMA your data onto an L2 cache and schedule the instructions so that multiplier units are fully occupied.
If you are on general purpose CPU, most you can do is to use aligned data, feed the cache lines by prefetching. If you have nested loops, then the inner most loop should go back and forth (i.e. iterate forward and then iterate backward) so that cache lines are utilised, etc.
You could also think of ways to parallelize the computation using multiple cores. If you can use a GPU this could significantly improve performance (albeit with a lesser precision).

In addition to what others have said, here are some techniques at speed optimization:
Profile
Find out where in the code most of the time is spent.
Only optimize that area to gain the mose benefit.
Unroll Loops
The processors don't like branches or jumps or changes in the execution path. In general, the processor has to reload the instruction pipeline which uses up time that can be spent on calculations. This includes function calls.
The technique is to place more "sets" of operations in your loop and reduce the number of iterations.
Declare Variables as Register
Variables that are used frequently should be declared as register. Although many members of SO have stated compilers ignore this suggestion, I have found out otherwise. Worst case, you wasted some time typing.
Keep Intense Calculations Short & Simple
Many processors have enough room in their instruction pipelines to hold small for loops. This reduces the amount of time spent reloading the instruction pipeline.
Distribute your big calculation loop into many small ones.
Perform Work on Small Sections of Arrays & Matrices
Many processors have a data cache, which is ultra fast memory very close to the processor. The processor likes to load the data cache once from off-processor memory. More loads require time that can be spent making calculations. Search the web for "Data Oriented Design Cache".
Think in Parallel Processor Terms
Change the design of your calculations so they can be easily adaptable to use with multiple processors. Many CPUs have multiple cores that can execute instructions in parallel. Some processors have enough intelligence to automatically delegate instructions to their multiple cores.
Some compilers can optimize code for parallel processing (look up the compiler options for your compiler). Designing your code for parallel processing will make this optimization easier for the compiler.
Analyze Assembly Listing of Functions
Print out the assembly language listing of your function.
Change the design of your function to match that of the assembly language or to help the compiler generate more optimal assembly language.
If you really need more efficiency, optimize the assembly language and put in as inline assembly code or as a separate module. I generally prefer the latter.
Examples
In your situation, take first 10 terms of the Taylor expansion, calculate them separately and place into individual variables:
double term1, term2, term3, term4;
double n, n1, n2, n3, n4;
n = 1.0;
for (i = 0; i < 100; ++i)
{
n1 = n + 2;
n2 = n + 4;
n3 = n + 6;
n4 = n + 8;
term1 = 4.0/n;
term2 = 4.0/n1;
term3 = 4.0/n2;
term4 = 4.0/n3;
Then sum up all of your terms:
result = term1 - term2 + term3 - term4;
// Or try sorting by operation, if possible:
// result = term1 + term3;
// result -= term2 + term4;
n = n4 + 2;
}

Lets consider two terms first:
cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b)
or cos(a+b) = cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b))
Taking cos to the RHS
a+b = acos( cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b)) ) ... 1
Here cos(a) = t_1 and cos(b) = t_2
a = acos(t_1) and b = acos(t_2)
By substituting in equation (1), we get
acos(t_1) + acos(t_2) = acos(t_1*t_2 - sqrt(1 - t_1*t_1) * sqrt(1 - t_2*t_2))
Here you can see that you have combined two acos into one. So you can pair up all the acos recursively and form a binary tree. At the end, you'll be left with an expression of the form sin(acos(x)) which equals sqrt(1 - x*x).
This will improve the time complexity.
However, I'm not sure about the complexity of calculating sqrt().

sin and cos are slow, is there an alternatve?

My game needs to move by a certain angle. To do this I get the vector of the angle via sin and cos. Unfortunately sin and cos are my bottleneck. I'm sure I do not need this much precision. Is there an alternative to a C sin & cos and look-up table that is decently precise but very fast?
I had found this:
float Skeleton::fastSin( float x )
{
const float B = 4.0f/pi;
const float C = -4.0f/(pi*pi);
float y = B * x + C * x * abs(x);
const float P = 0.225f;
return P * (y * abs(y) - y) + y;
}
Unfortunately, this does not seem to work. I get significantly different behavior when I use this sin rather than C sin.
Thanks

A lookup table is the standard solution. You could Also use two lookup tables on for degrees and one for tenths of degrees and utilize sin(A + B) = sin(a)cos(b) + cos(A)sin(b)

For your fastSin(), you should check its documentation to see what range it's valid on. The units you're using for your game could be too big or too small and scaling them to fit within that function's expected range could make it work better.
EDIT:
Someone else mentioned getting it into the desired range by subtracting PI, but apparently there's a function called fmod for doing modulus division on floats/doubles, so this should do it:
#include <iostream>
#include <cmath>
float fastSin( float x ){
x = fmod(x + M_PI, M_PI * 2) - M_PI; // restrict x so that -M_PI < x < M_PI
const float B = 4.0f/M_PI;
const float C = -4.0f/(M_PI*M_PI);
float y = B * x + C * x * std::abs(x);
const float P = 0.225f;
return P * (y * std::abs(y) - y) + y;
}
int main() {
std::cout << fastSin(100.0) << '\n' << std::sin(100.0) << std::endl;
}
I have no idea how expensive fmod is though, so I'm going to try a quick benchmark next.
Benchmark Results
I compiled this with -O2 and ran the result with the Unix time program:
int main() {
float a = 0;
for(int i = 0; i < REPETITIONS; i++) {
a += sin(i); // or fastSin(i);
}
std::cout << a << std::endl;
}
The result is that sin is about 1.8x slower (if fastSin takes 5 seconds, sin takes 9). The accuracy also seemed to be pretty good.
If you chose to go this route, make sure to compile with optimization on (-O2 in gcc).

I know this is already an old topic, but for people who have the same question, here is a tip.
A lot of times in 2D and 3D rotation, all vectors are rotated with a fixed angle. In stead of calling the cos() or sin() every cycle of the loop, create variable before the loop which contains the value of cos(angle) or sin(angle) already. You can use this variable in your loop. This way the function only has to be called once.

If you rephrase the return in fastSin as
return (1-P) * y + P * (y * abs(y))
And rewrite y as (for x>0 )
y = 4 * x * (pi-x) / (pi * pi)
you can see that y is a parabolic first-order approximation to sin(x) chosen so that it passes through (0,0), (pi/2,1) and (pi,0), and is symmetrical about x=pi/2.
Thus we can only expect our function to be a good approximation from 0 to pi. If we want values outside that range we can use the 2-pi periodicity of sin(x) and that sin(x+pi) = -sin(x).
The y*abs(y) is a "correction term" which also passes through those three points. (I'm not sure why y*abs(y) is used rather than just y*y since y is positive in the 0-pi range).
This form of overall approximation function guarantees that a linear blend of the two functions y and y*y, (1-P)*y + P * y*y will also pass through (0,0), (pi/2,1) and (pi,0).
We might expect y to be a decent approximation to sin(x), but the hope is that by picking a good value for P we get a better approximation.
One question is "How was P chosen?". Personally, I'd chose the P that produced the least RMS error over the 0,pi/2 interval. (I'm not sure that's how this P was chosen though)
Minimizing this wrt. P gives
This can be rearranged and solved for p
Wolfram alpha evaluates the initial integral to be the quadratic
E = (16 π^5 p^2 - (96 π^5 + 100800 π^2 - 967680)p + 651 π^5 - 20160 π^2)/(1260 π^4)
which has a minimum of
min(E) = -11612160/π^9 + 2419200/π^7 - 126000/π^5 - 2304/π^4 + 224/π^2 + (169 π)/420
≈ 5.582129689596371e-07
at
p = 3 + 30240/π^5 - 3150/π^3
≈ 0.2248391013559825
Which is pretty close to the specified P=0.225.
You can raise the accuracy of the approximation by adding an additional correction term. giving a form something like return (1-a-b)*y + a y * abs(y) + b y * y * abs(y). I would find a and b by in the same way as above, this time giving a system of two linear equations in a and b to solve, rather than a single equation in p. I'm not going to do the derivation as it is tedious and the conversion to latex images is painful... ;)
NOTE: When answering another question I thought of another valid choice for P.
The problem is that using reflection to extend the curve into (-pi,0) leaves a kink in the curve at x=0. However, I suspect we can choose P such that the kink becomes smooth.
To do this take the left and right derivatives at x=0 and ensure they are equal. This gives an equation for P.

You can compute a table S of 256 values, from sin(0) to sin(2 * pi). Then, to pick sin(x), bring back x in [0, 2 * pi], you can pick 2 values S[a], S[b] from the table, such as a < x < b. From this, linear interpolation, and you should have a fair approximation
memory saving trick : you actually need to store only from [0, pi / 2], and use symmetries of sin(x)
enhancement trick : linear interpolation can be a problem because of non-smooth derivatives, humans eyes is good at spotting such glitches in animation and graphics. Use cubic interpolation then.

What about
x*(0.0174532925199433-8.650935142277599*10^-7*x^2)
for deg and
x*(1-0.162716259904269*x^2)
for rad on -45, 45 and -pi/4 , pi/4 respectively?

This (i.e. the fastsin function) is approximating the sine function using a parabola. I suspect it's only good for values between -π and +π. Fortunately, you can keep adding or subtracting 2π until you get into this range. (Edited to specify what is approximating the sine function using a parabola.)

you can use this aproximation.
this solution use a quadratic curve :
http://www.starming.com/index.php?action=plugin&v=wave&ajax=iframe&iframe=fullviewonepost&mid=56&tid=4825

Create sine lookup table in C++

How can I rewrite the following pseudocode in C++?
real array sine_table[-1000..1000]
for x from -1000 to 1000
sine_table[x] := sine(pi * x / 1000)
I need to create a sine_table lookup table.

You can reduce the size of your table to 25% of the original by only storing values for the first quadrant, i.e. for x in [0,pi/2].
To do that your lookup routine just needs to map all values of x to the first quadrant using simple trig identities:
sin(x) = - sin(-x), to map from quadrant IV to I
sin(x) = sin(pi - x), to map from quadrant II to I
To map from quadrant III to I, apply both identities, i.e. sin(x) = - sin (pi + x)
Whether this strategy helps depends on how much memory usage matters in your case. But it seems wasteful to store four times as many values as you need just to avoid a comparison and subtraction or two during lookup.
I second Jeremy's recommendation to measure whether building a table is better than just using std::sin(). Even with the original large table, you'll have to spend cycles during each table lookup to convert the argument to the closest increment of pi/1000, and you'll lose some accuracy in the process.
If you're really trying to trade accuracy for speed, you might try approximating the sin() function using just the first few terms of the Taylor series expansion.
sin(x) = x - x^3/3! + x^5/5! ..., where ^ represents raising to a power and ! represents the factorial.
Of course, for efficiency, you should precompute the factorials and make use of the lower powers of x to compute higher ones, e.g. use x^3 when computing x^5.
One final point, the truncated Taylor series above is more accurate for values closer to zero, so its still worthwhile to map to the first or fourth quadrant before computing the approximate sine.
Addendum:
Yet one more potential improvement based on two observations:
1. You can compute any trig function if you can compute both the sine and cosine in the first octant [0,pi/4]
2. The Taylor series expansion centered at zero is more accurate near zero
So if you decide to use a truncated Taylor series, then you can improve accuracy (or use fewer terms for similar accuracy) by mapping to either the sine or cosine to get the angle in the range [0,pi/4] using identities like sin(x) = cos(pi/2-x) and cos(x) = sin(pi/2-x) in addition to the ones above (for example, if x > pi/4 once you've mapped to the first quadrant.)
Or if you decide to use a table lookup for both the sine and cosine, you could get by with two smaller tables that only covered the range [0,pi/4] at the expense of another possible comparison and subtraction on lookup to map to the smaller range. Then you could either use less memory for the tables, or use the same memory but provide finer granularity and accuracy.

long double sine_table[2001];
for (int index = 0; index < 2001; index++)
{
sine_table[index] = std::sin(PI * (index - 1000) / 1000.0);
}

One more point: calling trigonometric functions is pricey. if you want to prepare the lookup table for sine with constant step - you may save the calculation time, in expense of some potential precision loss.
Consider your minimal step is "a". That is, you need sin(a), sin(2a), sin(3a), ...
Then you may do the following trick: First calculate sin(a) and cos(a). Then for every consecutive step use the following trigonometric equalities:
sin([n+1] * a) = sin(n*a) * cos(a) + cos(n*a) * sin(a)
cos([n+1] * a) = cos(n*a) * cos(a) - sin(n*a) * sin(a)
The drawback of this method is that during this procedure the round-off error is accumulated.

double table[1000] = {0};
for (int i = 1; i <= 1000; i++)
{
sine_table[i-1] = std::sin(PI * i/ 1000.0);
}
double getSineValue(int multipleOfPi){
if(multipleOfPi == 0) return 0.0;
int sign = 1;
if(multipleOfPi < 0){
sign = -1;
}
return signsine_table[signmultipleOfPi - 1];
}
You can reduce the array length to 500, by a trick sin(pi/2 +/- angle) = +/- cos(angle).
So store sin and cos from 0 to pi/4.
I don't remember from top of my head but it increased the speed of my program.

You'll want the std::sin() function from <cmath>.

another approximation from a book or something
streamin ramp;
streamout sine;
float x,rect,k,i,j;
x = ramp -0.5;
rect = x * (1 - x < 0 & 2);
k = (rect + 0.42493299) *(rect -0.5) * (rect - 0.92493302) ;
i = 0.436501 + (rect * (rect + 1.05802));
j = 1.21551 + (rect * (rect - 2.0580201));
sine = i*j*k*60.252201*x;
full discussion here:
http://synthmaker.co.uk/forum/viewtopic.php?f=4&t=6457&st=0&sk=t&sd=a
I presume that you know, that using a division is a lot slower than multiplying by decimal number, /5 is always slower than *0.2
it's just an approximation.
also:
streamin ramp;
streamin x; // 1.5 = Saw 3.142 = Sin 4.5 = SawSin
streamout sine;
float saw,saw2;
saw = (ramp * 2 - 1) * x;
saw2 = saw * saw;
sine = -0.166667 + saw2 * (0.00833333 + saw2 * (-0.000198409 + saw2 * (2.7526e-006+saw2 * -2.39e-008)));
sine = saw * (1+ saw2 * sine);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js