qFastSin and qFastCos (Speed, safety and precision)

qFastSin and qFastCos (Speed, safety and precision) - c++

Recently I found two mathematical functions in qmath.h named qFastSin and qFastCos. These functions are inline and uses look-up tables to calculate the value of sin and cos:
inline qreal qFastSin(qreal x)
{
// Calculating si would be more accurate with qRound, but slower.
int si = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI));
qreal d = x - si * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
int ci = si + QT_SINE_TABLE_SIZE / 4;
si &= QT_SINE_TABLE_SIZE - 1;
ci &= QT_SINE_TABLE_SIZE - 1;
return qt_sine_table[si] + (qt_sine_table[ci] - 0.5 * qt_sine_table[si] * d) * d;
}
inline qreal qFastCos(qreal x)
{
// Calculating ci would be more accurate with qRound, but slower.
int ci = int(x * (0.5 * QT_SINE_TABLE_SIZE / M_PI));
qreal d = x - ci * (2.0 * M_PI / QT_SINE_TABLE_SIZE);
int si = ci + QT_SINE_TABLE_SIZE / 4;
si &= QT_SINE_TABLE_SIZE - 1;
ci &= QT_SINE_TABLE_SIZE - 1;
return qt_sine_table[si] - (qt_sine_table[ci] + 0.5 * qt_sine_table[si] * d) * d;
}
I searched Google and Qt-Assistant for information about them, but there is no good documentaion.
Does anybody know about precision and performance of these function? (Specially precision)

They are not part of the public API, not supported, not documented, and subject to change.
Qt only documents what it supports and it only supports what it documents. It's good like that.
It looks like a simple linear interpolation so accuracy depends on QT_SINE_TABLE_SIZE and also how close to a sample point the input happens to be. The worse case error will then be 1-sin(pi/2 + 2*pi*(QT_SINE_TABLE_SIZE/2))
If you care about performance more than accuracy then you can use them in practice but in theory they may be removed entirely in future.

I tested using random values and millions of iterations and the maximum error that I got was ~0.00000246408 compared to std::sin/cos. It also seems to be ~5x faster.

Related

Performance issues with simple calculations

EDIT 2: 16% decrease in program computation time! See bottom for calculation
I have written a N-body simulator, implementing the Barnes-Hut algorithm. Now I have a innocent looking function called CheckNode. Its simple and doesn't take long to compute, but the issue is, it gets called millions of times, so it takes up most of the calculation time between each frame.
I profiled the code, and this function is responsible for 84.58% of the total calculation time, and this is with only 10K particles, when I do it with up to 10x this, this function uses a greater and greater percentage.
Now here is the function, with percentage of time spent on the right in the red.
Now there are some alarming things here, Like a simple if statement taking 9.17% and another if statement accounting for over 20% of computation time! Is there any, even the slightest optimisation that can be done here, that would be multiplied over millions of function calls to allow my program to run faster?
EDIT:
Here is the CalculateForceNode function:
void CalculateForceNode(Body* bi, Node* bj) //bi is being attracted to bj. 15 flops of calculation
{
//vector from the body to the center of mass
double vectorx = bj->CenterOfMassx - bi->posX;
double vectory = bj->CenterOfMassy - bi->posY;
//c^2 = a^2 + b^2 + softener^2
double distSqr = vectorx * vectorx + vectory * vectory + Softener * Softener;
// ivnDistCube = 1/distSqr^(3/2)
double distSixth = distSqr * distSqr * distSqr;
double invDistCube = 1.0f / (sqrt(distSixth));
double Accel = (bj->TotalMass * invDistCube * _GRAV_CONST);
bi->AccelX += vectorx * Accel;
bi->AccelY += vectory * Accel;
}
EDIT 2:
Results of optimisations
The CheckNode function now takes up 82.03% of the total computation time (measured over a 1 min 37 sec sample), as opposed to previously it took up 84.58%.
Now logic tells that the remaining 15% of calculation time, took the same as the remaining 18% calculation time of the second program. So these identical periods (Its the same code) took 15% of the first program, and 18% of the second program. Letting the time to complete this other code be x the 1st program took 1/0.15 = 6.666x and the second took 1/0.18 = 5.555x. Then you can find the fraction that 5.555x is of 6.666x which calculates to be ~0.83 and therefor there was a (1 - 0.83 = 0.16) 16% decrease in program computation time!

First thing I would try is to reverse the elements in one of your conditions, replace:
if(withSqr / distanceSqr < nodeThresholdSqr || pNode->HasChildren == false)
with:
if(pNode->HasChildren == false || (withSqr / distanceSqr < nodeThresholdSqr))
If the first part of the condition is true pNode->HasChildren == false than the second one (withSqr / distanceSqr < nodeThresholdSqr) will never be executed (read: evaluated). Checking simple condition is much faster than operations on floating point numbers (division in your case). You can even take it to the next level: *do you need to compute the distanceSqr AT ALL when pNode->HasChildren == false ?
EDIT: even better:
if(pNode->HasChildren == false)
{
CalculateForceNode(pBody,pNode);
}
else
{
double distanceSqr = ((diffX * diffX) + (diffY * diffY));
double withSqr = pNode->width * pNode->width;
if(withSqr / distanceSqr < nodeThresholdSqr)
{
CalculateForceNode(pBody,pNode);
}
else
{//if not, repeat function with child
if(pNode->Child[0]->Bodies.size() > 0)
CheckNode(pNode->Child[0],pBody);
//..... - all the rest of your code
}
}

Profiling based on time spent is not enough, you need to know what was this time spent in - in other words use a more advanced profiler.
Also you don't mention any information about compiler or platform you are using.
For the if statement that is using 9% of the time, I don't think it is spent in the comparison, it is spent in fetching data. You have multiple levels of indirection (accessing data using pointer that takes you to another pointer and so on). This is bad for caching and branch prediction, and I guess you are spending time fetching data from memory or doing useless calculations because of branch miss prediction, not doing the actual comparison.
another note that I noticed: if (pNode->HasChildren == false) then you don't need all the calculations you made to find widthSqr. I think you should restructure your logic to check for this first, if the condition is false then you can can calculate widthSqr and continue your logic.

Since the function is called a lot of times you should get rid of the overhead of calling CalculateForceNode(...) by manually inlining the code. One you do this you will notice other tricks to apply:
void CheckNode(Node* pNode, Body* pBody)
{
double diffX = (pNode->CenterOfMass - pBody->posX);
double diffY = (pNode->CenterOfMass - pBody->posY);
double distanceSqr = ((diffX * diffX) + (diffY * diffY));
double widthSqr = pNode->width * pNode->width;
if (widthSqr / distanceSqr < NodeThresholdSqr || pNode->hasChildren == false)
{
//vector from the body to the center of mass
double vectorx = pNode->CenterOfMassx - pBody->posX;
double vectory = pNode->CenterOfMassy - pBody->posY;
//c^2 = a^2 + b^2 + softener^2
double distSqr = vectorx * vectorx + vectory * vectory + Softener * Softener;
// ivnDistCube = 1/distSqr^(3/2)
double distSixth = distSqr * distSqr * distSqr;
double invDistCube = 1.0f / (sqrt(distSixth));
double Accel = (pNode->TotalMass * invDistCube * _GRAV_CONST);
pBody->AccelX += vectorx * Accel;
pBody->AccelY += vectory * Accel;
}
else
{
CheckChildren(pNode, pBody);
}
}
Now you can see that diffX = vectorx, diffY = vectory, distSqr = distanceSqr*Softner*Softner. Reusing some of the calculation already made and precomputing whatever is possible should save you some cycles:
void CheckNode(Node* pNode, Body* pBody)
{
double diffX = (pNode->CenterOfMass - pBody->posX);
double diffY = (pNode->CenterOfMass - pBody->posY);
double distanceSqr = ((diffX * diffX) + (diffY * diffY));
double widthSqr = pNode->width * pNode->width;
double SoftnerSq = Softener * Softener; //precompute this value
if (widthSqr / distanceSqr < NodeThresholdSqr || pNode->hasChildren == false)
{
//c^2 = a^2 + b^2 + softener^2
double distSqr = distanceSqr + SoftnerSq;
// ivnDistCube = 1/distSqr^(3/2)
double distSixth = distSqr * distSqr * distSqr;
double invDistCube = 1.0f / (sqrt(distSixth));
double Accel = (pNode->TotalMass * invDistCube * _GRAV_CONST);
pBody->AccelX += diffX * Accel;
pBody->AccelY += diffY * Accel;
}
else
{
CheckChildren(pNode, pBody);
}
}
Hope this works for you.

First you should inline function Bodies.size() or access size directly so there is no overhead with function calling (it takes time to push all needed information to stack and pop it off).
I don't see all the code, but it looks like you can precalculate widthSqr. It can be calculated when the width is assigned not in the function.
You are using a lot of pointers here and it looks like your structures are scattered all over the memory. This will generate a lot CPU cache misses. Make sure that all the data needed for computation are in one, long, continuous and compact memory area.
In CalculateForceNode check if Softener*Softener can be precalculated. sqrt function is very time consuming. sqrt algorithm is iterative so you can sacrifice accuracy for speed by doing less iterations or you can use Look up tables.
You are doing the same calculations twice in CalculateForceNode.
void CalculateForceNode(Body* bi, Node* bj)
{
//vector from the body to the center of mass
double vectorx = bj->CenterOfMassx - bi->posX;
double vectory = bj->CenterOfMassy - bi->posY;
//c^2 = a^2 + b^2 + softener^2
double distSqr = vectorx * vectorx + vectory * vectory...
vectorx,vectory and distSqr were already calulated in CheckNode as diffX, diffY and distanceSqr. Manually inline whole function CalculateForceNode.

Swap your if statement around and move all your calculations inside the pNode->hasChildren == false part:
void CheckChildren(Node* pNode, Body* pBody)
{
if (pNode->Child[0]->Bodies.size() > 0)
CheckNode(...
}
void CheckNode(Node* pNode, Body* pBody)
{
if (pNode->hasChildren != false)
{
double diffX = (pNode->CenterOfMass - pBody->posX);
double diffY = (pNode->CenterOfMass - pBody->posY);
double distanceSqr = ((diffX * diffX) + (diffY * diffY));
double widthSqr = pNode->width * pNode->width;
if (widthSqr / distanceSqr < NodeThresholdSqr)
{
CalculateForceNode(pBody, pNode);
}
else
{
CheckChildren(pNode, pBody);
}
}
else
{
CheckChildren(pNode, pBody);
}
}

Fast approximate float division

On modern processors, float division is a good order of magnitude slower than float multiplication (when measured by reciprocal throughput).
I'm wondering if there are any algorithms out there for computating a fast approximation to x/y, given certain assumptions and tolerance levels. For example, if you assume that 0<x<y, and are willing to accept any output that is within 10% of the true value, are there algorithms faster than the built-in FDIV operation?

I hope that this helps because this is probably as close as your going to get to what you are looking for.
__inline__ double __attribute__((const)) divide( double y, double x ) {
// calculates y/x
union {
double dbl;
unsigned long long ull;
} u;
u.dbl = x; // x = x
u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> (unsigned char)1;
// pow( x, -0.5 )
u.dbl *= u.dbl; // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0/x
return u.dbl * y; // (1.0/x) * y = y/x
}
See also:
Another post about reciprocal approximation.
The Wikipedia page.

FDIV is usually exceptionally slower than FMUL just b/c it can't be piped like multiplication and requires multiple clk cycles for iterative convergence HW seeking process.
Easiest way is to simply recognize that division is nothing more than the multiplication of the dividend y and the inverse of the divisor x. The not so straight forward part is remembering a float value x = m * 2 ^ e & its inverse x^-1 = (1/m)*2^(-e) = (2/m)*2^(-e-1) = p * 2^q approximating this new mantissa p = 2/m = 3-x, for 1<=m<2. This gives a rough piece-wise linear approximation of the inverse function, however we can do a lot better by using an iterative Newton Root Finding Method to improve that approximation.
let w = f(x) = 1/x, the inverse of this function f(x) is found by solving for x in terms of w or x = f^(-1)(w) = 1/w. To improve the output with the root finding method we must first create a function whose zero reflects the desired output, i.e. g(w) = 1/w - x, d/dw(g(w)) = -1/w^2.
w[n+1]= w[n] - g(w[n])/g'(w[n]) = w[n] + w[n]^2 * (1/w[n] - x) = w[n] * (2 - x*w[n])
w[n+1] = w[n] * (2 - x*w[n]), when w[n]=1/x, w[n+1]=1/x*(2-x*1/x)=1/x
These components then add to get the final piece of code:
float inv_fast(float x) {
union { float f; int i; } v;
float w, sx;
int m;
sx = (x < 0) ? -1:1;
x = sx * x;
v.i = (int)(0x7EF127EA - *(uint32_t *)&x);
w = x * v.f;
// Efficient Iterative Approximation Improvement in horner polynomial form.
v.f = v.f * (2 - w); // Single iteration, Err = -3.36e-3 * 2^(-flr(log2(x)))
// v.f = v.f * ( 4 + w * (-6 + w * (4 - w))); // Second iteration, Err = -1.13e-5 * 2^(-flr(log2(x)))
// v.f = v.f * (8 + w * (-28 + w * (56 + w * (-70 + w *(56 + w * (-28 + w * (8 - w))))))); // Third Iteration, Err = +-6.8e-8 * 2^(-flr(log2(x)))
return v.f * sx;
}

sin function not accurate compared to math lib sin

I have been trying to implement a custom sin function that is fast, but more importantly, accurate (I cannot use math.h sin in my project). I'm not an expert when it comes to this kind of math, so work with me XD. After a little searching on the web i found the following code, the function is returning inaccurate results in certain cases.
float SinF(float X)
{
float Sine;
if (X < -3.14159265F) X += 6.28318531F;
else if (X > 3.14159265F) X -= 6.28318531F;
if (X < 0)
{
Sine = 1.27323954F * X + .405284735F * X * X;
if (Sine < 0) Sine = .225F * (Sine *-Sine - Sine) + Sine;
else Sine = .225F * (Sine * Sine - Sine) + Sine;
}
else
{
Sine = 1.27323954F * X - 0.405284735F * X * X;
if (Sine < 0) Sine = .225F * (Sine *-Sine - Sine) + Sine;
else Sine = .225F * (Sine * Sine - Sine) + Sine;
}
return Sine;
}
Examples:
Bad result example 1:
Value Passed: 1.57079637
Returned Value: 0.999999881
Correct Value: 1.00000000
Bad result example 2:
Value Passed: 1.76704633
Returned Value: 0.980933487
Correct Value: 0.980804682
Bad result example 3:
Value Passed: 1.96329641
Returned Value: 0.924392164
Correct Value: 0.923955679
Any help would be appreciated.

There's a bunch of potential implementations of sin and friends in this SO question, but typically it boils down to a few usual methods:
Built-in processor code (fsin)
Taylor series
CORDIC
Lookup tables with optional linear (or better) interpolation (mainly for speed, less accurate)
There (lots) of other methods but these are the more common ones I've seen.
Also be aware of the inherent precision limits of floating point (as user657267 linked to). For example, 1.57079637 is not exactly pi/2 so its sin() may not be exactly 1. In fact, all your "correct" values listed are not perfectly accurate. You have to decide just how accurate is good enough for your application.

C++ (and maths) : fast approximation of a trigonometric function

I know this is a recurring question, but I haven't really found a useful answer yet. I'm basically looking for a fast approximation of the function acos in C++, I'd like to know if I can significantly beat the standard one.
But some of you might have insights on my specific problem: I'm writing a scientific program which I need to be very fast. The complexity of the main algorithm boils down to computing the following expression (many times with different parameters):
sin( acos(t_1) + acos(t_2) + ... + acos(t_n) )
where the t_i are known real (double) numbers, and n is very small (like smaller than 6). I need a precision of at least 1e-10. I'm currently using the standard sin and acos C++ functions.
Do you think I can significantly gain speed somehow? For those of you who know some maths, do you think it would be smart to expand that sine in order to get an algebraic expression in terms of the t_i (only involving square roots)?
Thank you your your answers.

The code below provides simple implementations of sin() and acos() that should satisfy your accuracy requirements and that you might want to try. Please note that the math library implementation on your platform is very likely highly tuned for the specific hardware capabilities of that platform and is probably also coded in assembly for maximum efficiency, so simple compiled C code not catering to specifics of the hardware is unlikely to provide higher performance, even when the accuracy requirements are somewhat relaxed from full double precision. As Viktor Latypov points out, it may also be worthwhile to search for algorithmic alternatives that do not require expensive calls to transcendental math functions.
In the code below I have tried to stick to simple, portable constructs. If your compiler supports the rint() function [specified by C99 and C++11] you might want to use that instead of my_rint(). On some platforms, the call to floor() can be expensive since it requires dynamic changing of machine state. The functions my_rint(), sin_core(), cos_core(), and asin_core() would want to be inlined for best performance. Your compiler may do that automatically at high optimization levels (e.g. when compiling with -O3), or you could add an appropriate inlining attribute to these functions, e.g. inline or __inline depending on your toolchain.
Not knowing anything about your platform I opted for simple polynomial approximations, which are evaluated using Estrin's scheme plus Horner's scheme. See Wikipedia for a description of these evaluation schemes:
http://en.wikipedia.org/wiki/Estrin%27s_scheme ,
http://en.wikipedia.org/wiki/Horner_scheme
The approximations themselves are of the minimax type and were custom generated for this answer with the Remez algorithm:
http://en.wikipedia.org/wiki/Minimax_approximation_algorithm ,
http://en.wikipedia.org/wiki/Remez_algorithm
The identities used in the argument reduction for acos() are noted in the comments, for sin() I used a Cody/Waite-style argument reduction, as described in the following book:
W. J. Cody, W. Waite, Software Manual for the Elementary Functions. Prentice-Hall, 1980
The error bounds mentioned in the comments are approximate, and have not been rigorously tested or proven.
/* not quite rint(), i.e. results not properly rounded to nearest-or-even */
double my_rint (double x)
{
double t = floor (fabs(x) + 0.5);
return (x < 0.0) ? -t : t;
}
/* minimax approximation to cos on [-pi/4, pi/4] with rel. err. ~= 7.5e-13 */
double cos_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using Estrin's scheme */
return (-2.7236370439787708e-7 * x2 + 2.4799852696610628e-5) * x8 +
(-1.3888885054799695e-3 * x2 + 4.1666666636943683e-2) * x4 +
(-4.9999999999963024e-1 * x2 + 1.0000000000000000e+0);
}
/* minimax approximation to sin on [-pi/4, pi/4] with rel. err. ~= 5.5e-12 */
double sin_core (double x)
{
double x4, x2, t;
x2 = x * x;
x4 = x2 * x2;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return ((2.7181216275479732e-6 * x2 - 1.9839312269456257e-4) * x4 +
(8.3333293048425631e-3 * x2 - 1.6666666640797048e-1)) * x2 * x + x;
}
/* minimax approximation to arcsin on [0, 0.5625] with rel. err. ~= 1.5e-11 */
double asin_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return (((4.5334220547132049e-2 * x2 - 1.1226216762576600e-2) * x4 +
(2.6334281471361822e-2 * x2 + 2.0596336163223834e-2)) * x8 +
(3.0582043602875735e-2 * x2 + 4.4630538556294605e-2) * x4 +
(7.5000364034134126e-2 * x2 + 1.6666666300567365e-1)) * x2 * x + x;
}
/* relative error < 7e-12 on [-50000, 50000] */
double my_sin (double x)
{
double q, t;
int quadrant;
/* Cody-Waite style argument reduction */
q = my_rint (x * 6.3661977236758138e-1);
quadrant = (int)q;
t = x - q * 1.5707963267923333e+00;
t = t - q * 2.5633441515945189e-12;
if (quadrant & 1) {
t = cos_core(t);
} else {
t = sin_core(t);
}
return (quadrant & 2) ? -t : t;
}
/* relative error < 2e-11 on [-1, 1] */
double my_acos (double x)
{
double xa, t;
xa = fabs (x);
/* arcsin(x) = pi/2 - 2 * arcsin (sqrt ((1-x) / 2))
* arccos(x) = pi/2 - arcsin(x)
* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2))
*/
if (xa > 0.5625) {
t = 2.0 * asin_core (sqrt (0.5 * (1.0 - xa)));
} else {
t = 1.5707963267948966 - asin_core (xa);
}
/* arccos (-x) = pi - arccos(x) */
return (x < 0.0) ? (3.1415926535897932 - t) : t;
}

sin( acos(t1) + acos(t2) + ... + acos(tn) )
boils down to the calculation of
sin( acos(x) ) and cos(acos(x))=x
because
sin(a+b) = cos(a)sin(b)+sin(a)cos(b).
The first thing is
sin( acos(x) ) = sqrt(1-x*x)
Taylor series expansion for the sqrt reduces the problem to polynomial calculations.
To clarify, here's the expansion to n=2, n=3:
sin( acos(t1) + acos(t2) ) = sin(acos(t1))cos(acos(t2)) + sin(acos(t2))cos(acos(t1) = sqrt(1-t1*t1) * t2 + sqrt(1-t2*t2) * t1
cos( acos(t2) + acos(t3) ) = cos(acos(t2)) cos(acos(t3)) - sin(acos(t2))sin(acos(t3)) = t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)
sin( acos(t1) + acos(t2) + acos(t3)) =
sin(acos(t1))cos(acos(t2) + acos(t3)) + sin(acos(t2)+acos(t3) )cos(acos(t1)=
sqrt(1-t1*t1) * (t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)) + (sqrt(1-t2*t2) * t3 + sqrt(1-t3*t3) * t2 ) * t1
and so on.
The sqrt() for x in (-1,1) can be computed using
x_0 is some approximation, say, zero
x_(n+1) = 0.5 * (x_n + S/x_n) where S is the argument.
EDIT: I mean the "Babylonian method", see Wikipedia's article for details. You will need not more than 5-6 iterations to achieve 1e-10 with x in (0,1).

As Jonas Wielicki mentions in the comments, there isn't much precision trade-offs you can make.
Your best bet is to try and use the processor intrinsics for the functions (if your compiler doesn't do this already) and using some math to reduce the amount of calculations necessary.
Also very important is to keep everything in a CPU-friendly format, make sure there are few cache misses, etc.
If you are calculating large amounts of functions like acos perhaps moving to the GPU is an option for you?

You can try to create lookup tables, and use them instead of standard c++ functions, and see if you see any performance boost.

Significant gains can be made by aligning memory and streaming in the data to your kernel. Most often this dwarfs the gains that can be made by recreating the math functions. Think of how you can improve memory access to/from your kernel operator.
Memory access can be improved by using buffering techniques. This depends on your hardware platform. If you are running this on a DSP, you could DMA your data onto an L2 cache and schedule the instructions so that multiplier units are fully occupied.
If you are on general purpose CPU, most you can do is to use aligned data, feed the cache lines by prefetching. If you have nested loops, then the inner most loop should go back and forth (i.e. iterate forward and then iterate backward) so that cache lines are utilised, etc.
You could also think of ways to parallelize the computation using multiple cores. If you can use a GPU this could significantly improve performance (albeit with a lesser precision).

In addition to what others have said, here are some techniques at speed optimization:
Profile
Find out where in the code most of the time is spent.
Only optimize that area to gain the mose benefit.
Unroll Loops
The processors don't like branches or jumps or changes in the execution path. In general, the processor has to reload the instruction pipeline which uses up time that can be spent on calculations. This includes function calls.
The technique is to place more "sets" of operations in your loop and reduce the number of iterations.
Declare Variables as Register
Variables that are used frequently should be declared as register. Although many members of SO have stated compilers ignore this suggestion, I have found out otherwise. Worst case, you wasted some time typing.
Keep Intense Calculations Short & Simple
Many processors have enough room in their instruction pipelines to hold small for loops. This reduces the amount of time spent reloading the instruction pipeline.
Distribute your big calculation loop into many small ones.
Perform Work on Small Sections of Arrays & Matrices
Many processors have a data cache, which is ultra fast memory very close to the processor. The processor likes to load the data cache once from off-processor memory. More loads require time that can be spent making calculations. Search the web for "Data Oriented Design Cache".
Think in Parallel Processor Terms
Change the design of your calculations so they can be easily adaptable to use with multiple processors. Many CPUs have multiple cores that can execute instructions in parallel. Some processors have enough intelligence to automatically delegate instructions to their multiple cores.
Some compilers can optimize code for parallel processing (look up the compiler options for your compiler). Designing your code for parallel processing will make this optimization easier for the compiler.
Analyze Assembly Listing of Functions
Print out the assembly language listing of your function.
Change the design of your function to match that of the assembly language or to help the compiler generate more optimal assembly language.
If you really need more efficiency, optimize the assembly language and put in as inline assembly code or as a separate module. I generally prefer the latter.
Examples
In your situation, take first 10 terms of the Taylor expansion, calculate them separately and place into individual variables:
double term1, term2, term3, term4;
double n, n1, n2, n3, n4;
n = 1.0;
for (i = 0; i < 100; ++i)
{
n1 = n + 2;
n2 = n + 4;
n3 = n + 6;
n4 = n + 8;
term1 = 4.0/n;
term2 = 4.0/n1;
term3 = 4.0/n2;
term4 = 4.0/n3;
Then sum up all of your terms:
result = term1 - term2 + term3 - term4;
// Or try sorting by operation, if possible:
// result = term1 + term3;
// result -= term2 + term4;
n = n4 + 2;
}

Lets consider two terms first:
cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b)
or cos(a+b) = cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b))
Taking cos to the RHS
a+b = acos( cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b)) ) ... 1
Here cos(a) = t_1 and cos(b) = t_2
a = acos(t_1) and b = acos(t_2)
By substituting in equation (1), we get
acos(t_1) + acos(t_2) = acos(t_1*t_2 - sqrt(1 - t_1*t_1) * sqrt(1 - t_2*t_2))
Here you can see that you have combined two acos into one. So you can pair up all the acos recursively and form a binary tree. At the end, you'll be left with an expression of the form sin(acos(x)) which equals sqrt(1 - x*x).
This will improve the time complexity.
However, I'm not sure about the complexity of calculating sqrt().

Improving memory layout for parallel computing

I'm trying to optimize an algorithm (Lattice Boltzmann) for parallel computing using C++ AMP. And looking for some suggestions to optimize the memory layout, just found out that removing one parameter from the structure into another vector (the blocked vector) gave and increase of about 10%.
Anyone got any tips that can further improve this, or something i should take into consideration?
Below is the most time consuming function that is executed for each timestep, and the structure used for the layout.
struct grid_cell {
// int blocked; // Define if blocked
float n; // North
float ne; // North-East
float e; // East
float se; // South-East
float s;
float sw;
float w;
float nw;
float c; // Center
};
int collision(const struct st_parameters param, vector<struct grid_cell> &node, vector<struct grid_cell> &tmp_node, vector<int> &obstacle) {
int x,y;
int i = 0;
float c_sq = 1.0f/3.0f; // Square of speed of sound
float w0 = 4.0f/9.0f; // Weighting factors
float w1 = 1.0f/9.0f;
float w2 = 1.0f/36.0f;
int chunk = param.ny/20;
float total_density = 0;
float u_x,u_y; // Avrage velocities in x and y direction
float u[9]; // Directional velocities
float d_equ[9]; // Equalibrium densities
float u_sq; // Squared velocity
float local_density; // Sum of densities in a particular node
for(y=0;y<param.ny;y++) {
for(x=0;x<param.nx;x++) {
i = y*param.nx + x; // Node index
// Dont consider blocked cells
if (obstacle[i] == 0) {
// Calculate local density
local_density = 0.0;
local_density += tmp_node[i].n;
local_density += tmp_node[i].e;
local_density += tmp_node[i].s;
local_density += tmp_node[i].w;
local_density += tmp_node[i].ne;
local_density += tmp_node[i].se;
local_density += tmp_node[i].sw;
local_density += tmp_node[i].nw;
local_density += tmp_node[i].c;
// Calculate x velocity component
u_x = (tmp_node[i].e + tmp_node[i].ne + tmp_node[i].se -
(tmp_node[i].w + tmp_node[i].nw + tmp_node[i].sw))
/ local_density;
// Calculate y velocity component
u_y = (tmp_node[i].n + tmp_node[i].ne + tmp_node[i].nw -
(tmp_node[i].s + tmp_node[i].sw + tmp_node[i].se))
/ local_density;
// Velocity squared
u_sq = u_x*u_x +u_y*u_y;
// Directional velocity components;
u[1] = u_x; // East
u[2] = u_y; // North
u[3] = -u_x; // West
u[4] = - u_y; // South
u[5] = u_x + u_y; // North-East
u[6] = -u_x + u_y; // North-West
u[7] = -u_x - u_y; // South-West
u[8] = u_x - u_y; // South-East
// Equalibrium densities
// Zero velocity density: weight w0
d_equ[0] = w0 * local_density * (1.0f - u_sq / (2.0f * c_sq));
// Axis speeds: weight w1
d_equ[1] = w1 * local_density * (1.0f + u[1] / c_sq
+ (u[1] * u[1]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[2] = w1 * local_density * (1.0f + u[2] / c_sq
+ (u[2] * u[2]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[3] = w1 * local_density * (1.0f + u[3] / c_sq
+ (u[3] * u[3]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[4] = w1 * local_density * (1.0f + u[4] / c_sq
+ (u[4] * u[4]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
// Diagonal speeds: weight w2
d_equ[5] = w2 * local_density * (1.0f + u[5] / c_sq
+ (u[5] * u[5]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[6] = w2 * local_density * (1.0f + u[6] / c_sq
+ (u[6] * u[6]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[7] = w2 * local_density * (1.0f + u[7] / c_sq
+ (u[7] * u[7]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[8] = w2 * local_density * (1.0f + u[8] / c_sq
+ (u[8] * u[8]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
// Relaxation step
node[i].c = (tmp_node[i].c + param.omega * (d_equ[0] - tmp_node[i].c));
node[i].e = (tmp_node[i].e + param.omega * (d_equ[1] - tmp_node[i].e));
node[i].n = (tmp_node[i].n + param.omega * (d_equ[2] - tmp_node[i].n));
node[i].w = (tmp_node[i].w + param.omega * (d_equ[3] - tmp_node[i].w));
node[i].s = (tmp_node[i].s + param.omega * (d_equ[4] - tmp_node[i].s));
node[i].ne = (tmp_node[i].ne + param.omega * (d_equ[5] - tmp_node[i].ne));
node[i].nw = (tmp_node[i].nw + param.omega * (d_equ[6] - tmp_node[i].nw));
node[i].sw = (tmp_node[i].sw + param.omega * (d_equ[7] - tmp_node[i].sw));
node[i].se = (tmp_node[i].se + param.omega * (d_equ[8] - tmp_node[i].se));
}
}
}
return 1;
}

In general, you should make sure that data used on different cpus are not shared (easy) and are not on the same cache line (false sharing, see for example here: False Sharing is No Fun). Data used by the same cpu should be close together to benefit from caches.

Current GPUs are notoriously depending about memory layout. Without more details about your application here are some things I would suggest you explore:
Unit-stride access is very important so GPUs prefer “structs of arrays” to “arrays of structures”. As you did moving field “blocked” into vector “obstacle”, it should be advantageous to convert all of the fields of “grid_cell” into separate vectors. This should show benefit on CPU as well for loops that don’t access all of the fields.
If “obstacle” is very sparse (which I guess is unlikely) then moving it to its own vector is particularly value. GPUs like CPUs will load more than one word from the memory system either in cache lines or some other form and you waste bandwidth when you don’t need some of the data. For many system memory bandwidth is the bottleneck resource so any way to reduce bandwidth helps.
This is more speculative, but now that you are writing all of the output vector, it is possible the memory subsystem is avoiding reading values in “node” that will simply be overwritten
On some systems, the on-chip memory is split into banks and having an odd number of fields within your structure may help remove bank conflicts.
Some systems will also “vectorize” loads and stores so again removing “blocked” from the structure might enable more vectorization. The shift to struct-of-arrays mitigates this worry.
Thanks for your interest in C++ AMP.
David Callahan
http://blogs.msdn.com/b/nativeconcurrency/ C++ AMP Team Blog

Some small generic tops:
Any data structure that is shared across multiple processors should be read only.
Any data structure that requires modification is unique to the processor and does not share memory locality with data that is required by another processor.
Make sure your memory is arranged so that your code scans serially through it (not taking huge steps or jumping around).

For anyone looking into this topic some hints.
Lattice-Boltzmann is generally bandwidth limited. This means its performance depends mainly on the amount of data that can be loaded from and written to memory.
Use a highly efficient compiled programming language: C or C++ are good choices for CPU-based implementations.
Choose an architecture with a high bandwidth. For a CPU this means high clock RAM and a lot of memory channels (quad-channel or more).
This makes it crucial to use an appropriate linear memory layout that makes effective use of cache prefetching: The data is arranged in memory in small portions, so called cache lines. Whenever a processor accesses an element the entire cache line (on modern architectures 64 Bytes) it lies in are loaded. This means 8 double or 16 float values are loaded at once! While I have not found this to be a problem for multi-core processors as they share the L3 cache this should lead to problems on many-core architectures as changes to the same cache line have to be kept in sync and problems arise when other processors are working on data that another processor is working on (false sharing). This can be avoided by introducing padding, meaning you add elements you won't use to fill the rest of the cache line. Assume you want to update a cell with a discretisation with 27 speeds for the D3Q27-lattice then in the case of doubles (8 Bytes) the data lies on 4 distinct cache lines. You should add 5 doubles of padding to match the 32 Bytes (4*8 Bytes).
unsigned int const PAD = (64 - sizeof(double)*D3Q27.SPEEDS % 64); ///< padding: number of doubles
size_t const MEM_SIZE_POP = sizeof(double)*NZ*NY*NX*(SPEEDS+PAD); ///< amount of memory to be allocated
Most compilers naturally align the start of the array with a cache line so you don't have to take care of that.
The linear indices are inconvenient for accessing. Therefore you should design the accessing as efficient as possible. You could write a wrapper class. In any case inline those functions, meaning every call is replaced by their definition in the code.
inline size_t const D3Q27_PopIndex(unsigned int const x, unsigned int const y, unsigned int const z, unsigned int const d)
{
return (D3Q27.SPEEDS + D3Q27.PAD)*(NX*(NY*z + y) + x) + D3Q27.SPEEDS*p + d;
}
Furthermore cache locality can be increased by maximising the ratio between computation and communication for example using three-dimensional spatial loop blocking (Scaling issues with OpenMP), meaning every code works on a cube of cells instead of a single cell.
Generally implementations make use of two distinct populations A and B and perform collision and streaming from one implementation into another. This means every value in memory exists twice, once pre- and once post-collision. There exist different strategies for recombining steps and storing in such a way that you only have to keep one population copy in memory. For instance see the A-A pattern as proposed by P. Bailey et al. - "Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors" (https://www2.cs.arizona.edu/people/pbailey/Accelerating_GPU_LBM.pdf) or the Esoteric Twist by M. Geier & M. Schönherr - "Esoteric Twist: An Efficient in-Place Streaming Algorithmus for the Lattice Boltzmann Method on Massively Parallel Hardware" (https://pdfs.semanticscholar.org/ea64/3d63667900b60e6ff49f2746211700e63802.pdf). I have implemented the first with the use of macros meaning every access of a population calls a macro of the form:
#define O_E(a,b) a*odd + b*(!odd)
#define READ_f_0 D3Q27_PopIndex(x, y, z, 0, p)
#define READ_f_1 D3Q27_PopIndex(O_E(x_m, x), y, z, O_E( 1, 2), p)
#define READ_f_2 D3Q27_PopIndex(O_E(x_p, x), y, z, O_E( 2, 1), p)
...
#define WRITE_f_0 D3Q27_PopIndex(x, y, z, 0, p)
#define WRITE_f_1 D3Q27_PopIndex(O_E(x_p, x), y, z, O_E( 2, 1), p)
#define WRITE_f_2 D3Q27_PopIndex(O_E(x_m, x), y, z, O_E( 1, 2), p)
...
If you have multiple interaction populations use grid merging. Lay the indices out linearly in memory and put two distinct populations side by side. The accessing of population p works then as follows:
inline size_t const D3Q27_PopIndex(unsigned int const x, unsigned int const y, unsigned int const z, unsigned int const d, unsigned int const p = 0)
{
return (D3Q27.SPEEDS*D3Q27.NPOP + D3Q27.PAD)*(NX*(NY*z + y) + x) + D3Q27.SPEEDS*p + d;
}
For a regular grid make the algorithm as predictable as possible. Let every cell perform collision and streaming and then do the boundary conditions in reverse afterwards. If you have many cells that do not contribute directly to the algorithm omit them with a logical mask that you can store in the padding as well!
Make everything know to the compiler at compilation time: Template for example boundary conditions with a function that takes care of index changes so you don't have to rewrite every boundary condition.
Modern architectures have registers that can perform SIMD operations, so the same instruction on multiple data. Some processors (AVX-512) can process up to 512 bits of data and thus 32 doubles almost as fast as a single number. This seems to be very attractive for LBM in particular ever since gathering and scattering have been introduced (https://en.wikipedia.org/wiki/Gather-scatter_(vector_addressing)) but with the current bandwidth limitations (maybe it is worth it with DDR5 and processors with few cores) this is in my opinion not worth the hassle: The single core performance and parallel scaling is better (M. Wittmann et al. - "Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis" - https://arxiv.org/abs/1711.11468) but the overall algorithm performs not any better as it is bandwidth limited. So it only makes sense on architectures that are limited by the computing capacities rather than the bandwidth. On the Xeon Phi architecture the results seem to be remarkable Robertsen et al. - "High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor" (https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5072).
In my opinion most of this is not worth the effort for simple 2D implementations. Do the easy optimisations there, loop blocking, a linear memory layout but forget about the more complex access patterns. In 3D the effect can be enormous: I have achieved up to 95% parallel scalability and an overall performance of over 150 Mlups with a D3Q19 on a modern 12-core processor. For more performance switch to more adequate architectures like GPUs with CUDA C that are optimised for bandwidth.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js