GLSL vector addition vs multiple floats

GLSL vector addition vs multiple floats - glsl

Say i have this:
float x2 = q.x + q.x;
float y2 = q.y + q.y;
float z2 = q.z + q.z;
vs this:
vec3 q2 = q.xyz + q.xyz;
Is the vector operation better / faster or is this not a concern at GLSL level?

Yes its faster, GPUs are optimized for vectorized operations, using tools like AMDs ShaderAnalyzer or NVIDIAs nSight you can see the generated assembly revealing that your first snippet results in 3 scalar ops whereas your second one is a single vector op.

Related

Inefficient memory access pattern and irregular stride accesses

I'm trying to optimize this function:
bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
{
bool ret = false;
// input size (-1 for the safe bilinear interpolation)
const int width = im.cols-1;
const int height = im.rows-1;
// output size
const int halfWidth = res.cols >> 1;
const int halfHeight = res.rows >> 1;
float *out = res.ptr<float>(0);
const float *imptr = im.ptr<float>(0);
for (int j=-halfHeight; j<=halfHeight; ++j)
{
const float rx = ofsx + j * a12;
const float ry = ofsy + j * a22;
#pragma omp simd
for(int i=-halfWidth; i<=halfWidth; ++i, out++)
{
float wx = rx + i * a11;
float wy = ry + i * a21;
const int x = (int) floor(wx);
const int y = (int) floor(wy);
if (x >= 0 && y >= 0 && x < width && y < height)
{
// compute weights
wx -= x; wy -= y;
int rowOffset = y*im.cols;
int rowOffset1 = (y+1)*im.cols;
// bilinear interpolation
*out =
(1.0f - wy) *
((1.0f - wx) *
imptr[rowOffset+x] +
wx *
imptr[rowOffset+x+1]) +
( wy) *
((1.0f - wx) *
imptr[rowOffset1+x] +
wx *
imptr[rowOffset1+x+1]);
} else {
*out = 0;
ret = true; // touching boundary of the input
}
}
}
return ret;
}
I'm using Intel Advisor to optimize it and even though the inner for has already been vectorized, Intel Advisor detected inefficient memory access patterns:
60% of unit/zero stride access
40% of irregular/random stride access
In particular there are 4 gather (irregular) access in the following three instructions:
The problem of gather access from my understanding happens when the accessed element is of the type a[b], where b is unpredictable. This seems to be the case with imptr[rowOffset+x], where both rowOffset and x are unpredictable.
At the same time, I see this Vertical Invariant which should happen (again, from my understanding) when elements are accessed with a constant offset. But actually I don't see where this constant offset
So I have 3 questions:
Did I understood the problem of gather accesses correctly?
What about the Vertical Invariant access? I'm less sure about this point.
Finally, how can I improve/solve the memory access here?
Compiled with icpc 2017 update 3 with the following flags:
INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl

Vectorizing (SIMD-izing) your code does not automatically make your access pattern better (or worse).
To maximize vectorized code performance you have to try to have unit stride (also called contiguous, linear, stride-1) memory access pattern in your code. Or at least "predictable" regular stride-N, where N should ideally be moderately low value.
Without introducing such regularity - you keep your memory LOAD/STORE operations partially sequential (non parallel) at instruction level. So each time you want to do "parallel" addtion/multiplication etc, you have to do "non-parallel" original data elements "gathering".
In your case there seem to be regular stride-N (logically) - this is seen from both code snippet and from Advisor MAP output (on the right side panel).
Vertical invariant - means that you sometimes access the same memory location between iterations. Unit-stride means that you have logically contiguous memory access in other case.
However, the code structure is complicated: you have if-statement in loop body, you have complex conditions and floating point --> integer (simple, but still) conversions.
Therefore compiler has to use most generic and most inefficient method (gathers) "just in case" and as a result your physical , factual memory access pattern (from compiler code generation) is irregular "GATHER", but logically your access pattern is regular (invariant or unit-stride).
Solution may not be very easy, but I would try following :
If algorithm allows that - consider excluding if-statement. This can sometimes be achieved by splitting loop into several ones.
Try to get ride of semi-floating point induction variables, floor etc. Try to make them integers and use "canonic" form ( for (i) array [super-simple-expression(i)] = something)
Try to use linear clause of pragma simd to inform compiler that there is actually unit-stride present somewhere

More efficient way to do this in GLSL

x += offset * vec3(notEqual(a, greaterThanEqual(fract(b), vec3(0.5))));
x and b are vector3, and a is bvec3.
This seems fairly expensive and i'm wondering if there is another way to do it. Basically I want to offset x component wise by a fixed amount depending on if b's fractional component is above 0.5, and if a is 1 or 0 (true or false). If it's 1 and <0.5, do an offset, if it's 0 and >0.5 do an offset, like xor, I use notEqual for xor here.

I can't think of anything massively better. This one looks slightly simpler than what you have:
x += offset * vec3(notEqual(vec3(a), round(fract(b))));
Or similarly:
x += offset * vec3(notEqual(vec3(a), step(0.5, fract(b))));
If the notEqual() is expensive, the xor-type operation could be replaced by using a sum modulo 2:
x += offset * mod(vec3(a) + round(fract(b)), 2.0);
Or in a similar spirit, but avoiding the mod() at the price of a few more basic operations:
vec3 af = vec3(a);
vec3 brf = round(fract(b));
x += offset * (af + brf - 2.0 * af * brf);
There's probably countless more variations and permutations of similar ideas. As was already suggested in the comments, there's almost no way around benchmarking them on a good cross-section of the hardware you care about.

Efficient Bicubic filtering code in GLSL?

I'm wondering if anyone has complete, working, and efficient code to do bicubic texture filtering in glsl. There is this:
http://www.codeproject.com/Articles/236394/Bi-Cubic-and-Bi-Linear-Interpolation-with-GLSL
or
https://github.com/visionworkbench/visionworkbench/blob/master/src/vw/GPU/Shaders/Interp/interpolation-bicubic.glsl
but both do 16 texture reads where only 4 are necessary:
https://groups.google.com/forum/#!topic/comp.graphics.api.opengl/kqrujgJfTxo
However the method above uses a missing "cubic()" function that I don't know what it is supposed to do, and also takes an unexplained "texscale" parameter.
There is also the NVidia version:
https://developer.nvidia.com/gpugems/gpugems2/part-iii-high-quality-rendering/chapter-20-fast-third-order-texture-filtering
but I believe this uses CUDA, which is specific to NVidia's cards. I need glsl.
I could probably port the nvidia version to glsl, but thought I'd ask first to see if anyone already has a complete, working glsl bicubic shader.

I found this implementation which can be used as a drop-in replacement for texture() (from http://www.java-gaming.org/index.php?topic=35123.0 (one typo fixed)):
// from http://www.java-gaming.org/index.php?topic=35123.0
vec4 cubic(float v){
vec4 n = vec4(1.0, 2.0, 3.0, 4.0) - v;
vec4 s = n * n * n;
float x = s.x;
float y = s.y - 4.0 * s.x;
float z = s.z - 4.0 * s.y + 6.0 * s.x;
float w = 6.0 - x - y - z;
return vec4(x, y, z, w) * (1.0/6.0);
}
vec4 textureBicubic(sampler2D sampler, vec2 texCoords){
vec2 texSize = textureSize(sampler, 0);
vec2 invTexSize = 1.0 / texSize;
texCoords = texCoords * texSize - 0.5;
vec2 fxy = fract(texCoords);
texCoords -= fxy;
vec4 xcubic = cubic(fxy.x);
vec4 ycubic = cubic(fxy.y);
vec4 c = texCoords.xxyy + vec2 (-0.5, +1.5).xyxy;
vec4 s = vec4(xcubic.xz + xcubic.yw, ycubic.xz + ycubic.yw);
vec4 offset = c + vec4 (xcubic.yw, ycubic.yw) / s;
offset *= invTexSize.xxyy;
vec4 sample0 = texture(sampler, offset.xz);
vec4 sample1 = texture(sampler, offset.yz);
vec4 sample2 = texture(sampler, offset.xw);
vec4 sample3 = texture(sampler, offset.yw);
float sx = s.x / (s.x + s.y);
float sy = s.z / (s.z + s.w);
return mix(
mix(sample3, sample2, sx), mix(sample1, sample0, sx)
, sy);
}
Example: Nearest, bilinear, bicubic:
The ImageData of this image is
{{{0.698039, 0.996078, 0.262745}, {0., 0.266667, 1.}, {0.00392157,
0.25098, 0.996078}, {1., 0.65098, 0.}}, {{0.996078, 0.823529,
0.}, {0.498039, 0., 0.00392157}, {0.831373, 0.00392157,
0.00392157}, {0.956863, 0.972549, 0.00784314}}, {{0.909804,
0.00784314, 0.}, {0.87451, 0.996078, 0.0862745}, {0.196078,
0.992157, 0.760784}, {0.00392157, 0.00392157, 0.498039}}, {{1.,
0.878431, 0.}, {0.588235, 0.00392157, 0.00392157}, {0.00392157,
0.0666667, 0.996078}, {0.996078, 0.517647, 0.}}}
I tried to reproduce this (many other interpolation techniques)
but they have clamped padding, while I have repeating (wrapping) boundaries. Therefore it is not exactly the same.
It seems this bicubic business is not a proper interpolation, i.e. it does not take on the original values at the points where the data is defined.

I decided to take a minute to dig my old Perforce activities and found the missing cubic() function; enjoy! :)
vec4 cubic(float v)
{
vec4 n = vec4(1.0, 2.0, 3.0, 4.0) - v;
vec4 s = n * n * n;
float x = s.x;
float y = s.y - 4.0 * s.x;
float z = s.z - 4.0 * s.y + 6.0 * s.x;
float w = 6.0 - x - y - z;
return vec4(x, y, z, w);
}

Wow. I recognize the code above (I can not comment w/ reputation < 50) as I came up with it in early 2011. The problem I was trying to solve was related to old IBM T42 (sorry the exact model number escapes me) laptop and it's ATI graphics stack. I developed the code on NV card and originally I used 16 texture fetches. That was kinda of slow but fast enough for my purposes. When someone reported it did not work on his laptop it became apparent that they did not support enough texture fetches per fragment. I had to engineer a work-around and the best I could come up with was to do it with number of texture fetches that would work.
I thought about it like this: okay, so if I handle each quad (2x2) with linear filter the remaining problem is can the rows and columns share the weights? That was the only problem on my mind when I set out to craft the code. Of course they could be shared; the weights are same for each column and row; perfect!
Now I had four samples. The remaining problem was how to correctly combine the samples. That was the biggest obstacle to overcome. It took about 10 minutes with pencil and paper. With trembling hands I typed the code in and it worked, nice. Then I uploaded the binaries to the guy who promised to check it out on his T42 (?) and he reported it worked. The end. :)
I can assure that the equations check out and give mathematically identical results to computing the samples individually. FYI: with CPU it's faster to do horizontal and vertical scan separately. With GPU multiple passes is not that great idea, especially when it's probably not feasible anyway in typical use case.
Food for thought: it is possible to use a texture lookup for the cubic() function. Which is faster depends on the GPU but generally speaking, the sampler is light on the ALU side just doing the arithmetic would balance things out. YMMV.

The missing function cubic() in JAre's answer could look like this:
vec4 cubic(float x)
{
float x2 = x * x;
float x3 = x2 * x;
vec4 w;
w.x = -x3 + 3*x2 - 3*x + 1;
w.y = 3*x3 - 6*x2 + 4;
w.z = -3*x3 + 3*x2 + 3*x + 1;
w.w = x3;
return w / 6.f;
}
It returns the four weights for cubic B-Spline.
It is all explained in NVidia Gems.

(EDIT)
Cubic() is a cubic spline function
Example:
Texscale is sampling window size coefficient. You can start with 1.0 value.
vec4 filter(sampler2D texture, vec2 texcoord, vec2 texscale)
{
float fx = fract(texcoord.x);
float fy = fract(texcoord.y);
texcoord.x -= fx;
texcoord.y -= fy;
vec4 xcubic = cubic(fx);
vec4 ycubic = cubic(fy);
vec4 c = vec4(texcoord.x - 0.5, texcoord.x + 1.5, texcoord.y -
0.5, texcoord.y + 1.5);
vec4 s = vec4(xcubic.x + xcubic.y, xcubic.z + xcubic.w, ycubic.x +
ycubic.y, ycubic.z + ycubic.w);
vec4 offset = c + vec4(xcubic.y, xcubic.w, ycubic.y, ycubic.w) /
s;
vec4 sample0 = texture2D(texture, vec2(offset.x, offset.z) *
texscale);
vec4 sample1 = texture2D(texture, vec2(offset.y, offset.z) *
texscale);
vec4 sample2 = texture2D(texture, vec2(offset.x, offset.w) *
texscale);
vec4 sample3 = texture2D(texture, vec2(offset.y, offset.w) *
texscale);
float sx = s.x / (s.x + s.y);
float sy = s.z / (s.z + s.w);
return mix(
mix(sample3, sample2, sx),
mix(sample1, sample0, sx), sy);
}
Source

For anybody interested in GLSL code to do tri-cubic interpolation, ray-casting code using cubic interpolation can be found in the examples/glCubicRayCast folder in:
http://www.dannyruijters.nl/cubicinterpolation/CI.zip
edit: The cubic interpolation code is now available on github: CUDA version and WebGL version, and GLSL sample.

I've been using #Maf 's cubic spline recipe for over a year, and I recommend it, if a cubic B-spline meets your needs.
But I recently realized that, for my particular application, it is important for the intensities to match exactly at the sample points. So I switched to using a Catmull-Rom spline, which uses a slightly different recipe like so:
// Catmull-Rom spline actually passes through control points
vec4 cubic(float x) // cubic_catmullrom(float x)
{
const float s = 0.5; // potentially adjustable parameter
float x2 = x * x;
float x3 = x2 * x;
vec4 w;
w.x = -s*x3 + 2*s*x2 - s*x + 0;
w.y = (2-s)*x3 + (s-3)*x2 + 1;
w.z = (s-2)*x3 + (3-2*s)*x2 + s*x + 0;
w.w = s*x3 - s*x2 + 0;
return w;
}
I found these coefficients, plus those for a number of other flavors of cubic splines, in the lecture notes at:
http://www.cs.cmu.edu/afs/cs/academic/class/15462-s10/www/lec-slides/lec06.pdf

I think it is possible that the Catmull version could be done with 4 texture lookups by (a) arranging the input texture like a chessboard with alternate slots saved as positives and as negatives, and (b) an associated modification of textureBicubic. That would rely on the contributions/weights w.x/w.w always being negative, and the contributions w.y/w.z always being positive. I haven't double-checked if this is true, or exactly how the modified textureBicubic would look.
... I have verified that w contributions do satisfy the +ve -ve rules.

C++ (and maths) : fast approximation of a trigonometric function

I know this is a recurring question, but I haven't really found a useful answer yet. I'm basically looking for a fast approximation of the function acos in C++, I'd like to know if I can significantly beat the standard one.
But some of you might have insights on my specific problem: I'm writing a scientific program which I need to be very fast. The complexity of the main algorithm boils down to computing the following expression (many times with different parameters):
sin( acos(t_1) + acos(t_2) + ... + acos(t_n) )
where the t_i are known real (double) numbers, and n is very small (like smaller than 6). I need a precision of at least 1e-10. I'm currently using the standard sin and acos C++ functions.
Do you think I can significantly gain speed somehow? For those of you who know some maths, do you think it would be smart to expand that sine in order to get an algebraic expression in terms of the t_i (only involving square roots)?
Thank you your your answers.

The code below provides simple implementations of sin() and acos() that should satisfy your accuracy requirements and that you might want to try. Please note that the math library implementation on your platform is very likely highly tuned for the specific hardware capabilities of that platform and is probably also coded in assembly for maximum efficiency, so simple compiled C code not catering to specifics of the hardware is unlikely to provide higher performance, even when the accuracy requirements are somewhat relaxed from full double precision. As Viktor Latypov points out, it may also be worthwhile to search for algorithmic alternatives that do not require expensive calls to transcendental math functions.
In the code below I have tried to stick to simple, portable constructs. If your compiler supports the rint() function [specified by C99 and C++11] you might want to use that instead of my_rint(). On some platforms, the call to floor() can be expensive since it requires dynamic changing of machine state. The functions my_rint(), sin_core(), cos_core(), and asin_core() would want to be inlined for best performance. Your compiler may do that automatically at high optimization levels (e.g. when compiling with -O3), or you could add an appropriate inlining attribute to these functions, e.g. inline or __inline depending on your toolchain.
Not knowing anything about your platform I opted for simple polynomial approximations, which are evaluated using Estrin's scheme plus Horner's scheme. See Wikipedia for a description of these evaluation schemes:
http://en.wikipedia.org/wiki/Estrin%27s_scheme ,
http://en.wikipedia.org/wiki/Horner_scheme
The approximations themselves are of the minimax type and were custom generated for this answer with the Remez algorithm:
http://en.wikipedia.org/wiki/Minimax_approximation_algorithm ,
http://en.wikipedia.org/wiki/Remez_algorithm
The identities used in the argument reduction for acos() are noted in the comments, for sin() I used a Cody/Waite-style argument reduction, as described in the following book:
W. J. Cody, W. Waite, Software Manual for the Elementary Functions. Prentice-Hall, 1980
The error bounds mentioned in the comments are approximate, and have not been rigorously tested or proven.
/* not quite rint(), i.e. results not properly rounded to nearest-or-even */
double my_rint (double x)
{
double t = floor (fabs(x) + 0.5);
return (x < 0.0) ? -t : t;
}
/* minimax approximation to cos on [-pi/4, pi/4] with rel. err. ~= 7.5e-13 */
double cos_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using Estrin's scheme */
return (-2.7236370439787708e-7 * x2 + 2.4799852696610628e-5) * x8 +
(-1.3888885054799695e-3 * x2 + 4.1666666636943683e-2) * x4 +
(-4.9999999999963024e-1 * x2 + 1.0000000000000000e+0);
}
/* minimax approximation to sin on [-pi/4, pi/4] with rel. err. ~= 5.5e-12 */
double sin_core (double x)
{
double x4, x2, t;
x2 = x * x;
x4 = x2 * x2;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return ((2.7181216275479732e-6 * x2 - 1.9839312269456257e-4) * x4 +
(8.3333293048425631e-3 * x2 - 1.6666666640797048e-1)) * x2 * x + x;
}
/* minimax approximation to arcsin on [0, 0.5625] with rel. err. ~= 1.5e-11 */
double asin_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return (((4.5334220547132049e-2 * x2 - 1.1226216762576600e-2) * x4 +
(2.6334281471361822e-2 * x2 + 2.0596336163223834e-2)) * x8 +
(3.0582043602875735e-2 * x2 + 4.4630538556294605e-2) * x4 +
(7.5000364034134126e-2 * x2 + 1.6666666300567365e-1)) * x2 * x + x;
}
/* relative error < 7e-12 on [-50000, 50000] */
double my_sin (double x)
{
double q, t;
int quadrant;
/* Cody-Waite style argument reduction */
q = my_rint (x * 6.3661977236758138e-1);
quadrant = (int)q;
t = x - q * 1.5707963267923333e+00;
t = t - q * 2.5633441515945189e-12;
if (quadrant & 1) {
t = cos_core(t);
} else {
t = sin_core(t);
}
return (quadrant & 2) ? -t : t;
}
/* relative error < 2e-11 on [-1, 1] */
double my_acos (double x)
{
double xa, t;
xa = fabs (x);
/* arcsin(x) = pi/2 - 2 * arcsin (sqrt ((1-x) / 2))
* arccos(x) = pi/2 - arcsin(x)
* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2))
*/
if (xa > 0.5625) {
t = 2.0 * asin_core (sqrt (0.5 * (1.0 - xa)));
} else {
t = 1.5707963267948966 - asin_core (xa);
}
/* arccos (-x) = pi - arccos(x) */
return (x < 0.0) ? (3.1415926535897932 - t) : t;
}

sin( acos(t1) + acos(t2) + ... + acos(tn) )
boils down to the calculation of
sin( acos(x) ) and cos(acos(x))=x
because
sin(a+b) = cos(a)sin(b)+sin(a)cos(b).
The first thing is
sin( acos(x) ) = sqrt(1-x*x)
Taylor series expansion for the sqrt reduces the problem to polynomial calculations.
To clarify, here's the expansion to n=2, n=3:
sin( acos(t1) + acos(t2) ) = sin(acos(t1))cos(acos(t2)) + sin(acos(t2))cos(acos(t1) = sqrt(1-t1*t1) * t2 + sqrt(1-t2*t2) * t1
cos( acos(t2) + acos(t3) ) = cos(acos(t2)) cos(acos(t3)) - sin(acos(t2))sin(acos(t3)) = t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)
sin( acos(t1) + acos(t2) + acos(t3)) =
sin(acos(t1))cos(acos(t2) + acos(t3)) + sin(acos(t2)+acos(t3) )cos(acos(t1)=
sqrt(1-t1*t1) * (t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)) + (sqrt(1-t2*t2) * t3 + sqrt(1-t3*t3) * t2 ) * t1
and so on.
The sqrt() for x in (-1,1) can be computed using
x_0 is some approximation, say, zero
x_(n+1) = 0.5 * (x_n + S/x_n) where S is the argument.
EDIT: I mean the "Babylonian method", see Wikipedia's article for details. You will need not more than 5-6 iterations to achieve 1e-10 with x in (0,1).

As Jonas Wielicki mentions in the comments, there isn't much precision trade-offs you can make.
Your best bet is to try and use the processor intrinsics for the functions (if your compiler doesn't do this already) and using some math to reduce the amount of calculations necessary.
Also very important is to keep everything in a CPU-friendly format, make sure there are few cache misses, etc.
If you are calculating large amounts of functions like acos perhaps moving to the GPU is an option for you?

You can try to create lookup tables, and use them instead of standard c++ functions, and see if you see any performance boost.

Significant gains can be made by aligning memory and streaming in the data to your kernel. Most often this dwarfs the gains that can be made by recreating the math functions. Think of how you can improve memory access to/from your kernel operator.
Memory access can be improved by using buffering techniques. This depends on your hardware platform. If you are running this on a DSP, you could DMA your data onto an L2 cache and schedule the instructions so that multiplier units are fully occupied.
If you are on general purpose CPU, most you can do is to use aligned data, feed the cache lines by prefetching. If you have nested loops, then the inner most loop should go back and forth (i.e. iterate forward and then iterate backward) so that cache lines are utilised, etc.
You could also think of ways to parallelize the computation using multiple cores. If you can use a GPU this could significantly improve performance (albeit with a lesser precision).

In addition to what others have said, here are some techniques at speed optimization:
Profile
Find out where in the code most of the time is spent.
Only optimize that area to gain the mose benefit.
Unroll Loops
The processors don't like branches or jumps or changes in the execution path. In general, the processor has to reload the instruction pipeline which uses up time that can be spent on calculations. This includes function calls.
The technique is to place more "sets" of operations in your loop and reduce the number of iterations.
Declare Variables as Register
Variables that are used frequently should be declared as register. Although many members of SO have stated compilers ignore this suggestion, I have found out otherwise. Worst case, you wasted some time typing.
Keep Intense Calculations Short & Simple
Many processors have enough room in their instruction pipelines to hold small for loops. This reduces the amount of time spent reloading the instruction pipeline.
Distribute your big calculation loop into many small ones.
Perform Work on Small Sections of Arrays & Matrices
Many processors have a data cache, which is ultra fast memory very close to the processor. The processor likes to load the data cache once from off-processor memory. More loads require time that can be spent making calculations. Search the web for "Data Oriented Design Cache".
Think in Parallel Processor Terms
Change the design of your calculations so they can be easily adaptable to use with multiple processors. Many CPUs have multiple cores that can execute instructions in parallel. Some processors have enough intelligence to automatically delegate instructions to their multiple cores.
Some compilers can optimize code for parallel processing (look up the compiler options for your compiler). Designing your code for parallel processing will make this optimization easier for the compiler.
Analyze Assembly Listing of Functions
Print out the assembly language listing of your function.
Change the design of your function to match that of the assembly language or to help the compiler generate more optimal assembly language.
If you really need more efficiency, optimize the assembly language and put in as inline assembly code or as a separate module. I generally prefer the latter.
Examples
In your situation, take first 10 terms of the Taylor expansion, calculate them separately and place into individual variables:
double term1, term2, term3, term4;
double n, n1, n2, n3, n4;
n = 1.0;
for (i = 0; i < 100; ++i)
{
n1 = n + 2;
n2 = n + 4;
n3 = n + 6;
n4 = n + 8;
term1 = 4.0/n;
term2 = 4.0/n1;
term3 = 4.0/n2;
term4 = 4.0/n3;
Then sum up all of your terms:
result = term1 - term2 + term3 - term4;
// Or try sorting by operation, if possible:
// result = term1 + term3;
// result -= term2 + term4;
n = n4 + 2;
}

Lets consider two terms first:
cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b)
or cos(a+b) = cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b))
Taking cos to the RHS
a+b = acos( cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b)) ) ... 1
Here cos(a) = t_1 and cos(b) = t_2
a = acos(t_1) and b = acos(t_2)
By substituting in equation (1), we get
acos(t_1) + acos(t_2) = acos(t_1*t_2 - sqrt(1 - t_1*t_1) * sqrt(1 - t_2*t_2))
Here you can see that you have combined two acos into one. So you can pair up all the acos recursively and form a binary tree. At the end, you'll be left with an expression of the form sin(acos(x)) which equals sqrt(1 - x*x).
This will improve the time complexity.
However, I'm not sure about the complexity of calculating sqrt().

Improving memory layout for parallel computing

I'm trying to optimize an algorithm (Lattice Boltzmann) for parallel computing using C++ AMP. And looking for some suggestions to optimize the memory layout, just found out that removing one parameter from the structure into another vector (the blocked vector) gave and increase of about 10%.
Anyone got any tips that can further improve this, or something i should take into consideration?
Below is the most time consuming function that is executed for each timestep, and the structure used for the layout.
struct grid_cell {
// int blocked; // Define if blocked
float n; // North
float ne; // North-East
float e; // East
float se; // South-East
float s;
float sw;
float w;
float nw;
float c; // Center
};
int collision(const struct st_parameters param, vector<struct grid_cell> &node, vector<struct grid_cell> &tmp_node, vector<int> &obstacle) {
int x,y;
int i = 0;
float c_sq = 1.0f/3.0f; // Square of speed of sound
float w0 = 4.0f/9.0f; // Weighting factors
float w1 = 1.0f/9.0f;
float w2 = 1.0f/36.0f;
int chunk = param.ny/20;
float total_density = 0;
float u_x,u_y; // Avrage velocities in x and y direction
float u[9]; // Directional velocities
float d_equ[9]; // Equalibrium densities
float u_sq; // Squared velocity
float local_density; // Sum of densities in a particular node
for(y=0;y<param.ny;y++) {
for(x=0;x<param.nx;x++) {
i = y*param.nx + x; // Node index
// Dont consider blocked cells
if (obstacle[i] == 0) {
// Calculate local density
local_density = 0.0;
local_density += tmp_node[i].n;
local_density += tmp_node[i].e;
local_density += tmp_node[i].s;
local_density += tmp_node[i].w;
local_density += tmp_node[i].ne;
local_density += tmp_node[i].se;
local_density += tmp_node[i].sw;
local_density += tmp_node[i].nw;
local_density += tmp_node[i].c;
// Calculate x velocity component
u_x = (tmp_node[i].e + tmp_node[i].ne + tmp_node[i].se -
(tmp_node[i].w + tmp_node[i].nw + tmp_node[i].sw))
/ local_density;
// Calculate y velocity component
u_y = (tmp_node[i].n + tmp_node[i].ne + tmp_node[i].nw -
(tmp_node[i].s + tmp_node[i].sw + tmp_node[i].se))
/ local_density;
// Velocity squared
u_sq = u_x*u_x +u_y*u_y;
// Directional velocity components;
u[1] = u_x; // East
u[2] = u_y; // North
u[3] = -u_x; // West
u[4] = - u_y; // South
u[5] = u_x + u_y; // North-East
u[6] = -u_x + u_y; // North-West
u[7] = -u_x - u_y; // South-West
u[8] = u_x - u_y; // South-East
// Equalibrium densities
// Zero velocity density: weight w0
d_equ[0] = w0 * local_density * (1.0f - u_sq / (2.0f * c_sq));
// Axis speeds: weight w1
d_equ[1] = w1 * local_density * (1.0f + u[1] / c_sq
+ (u[1] * u[1]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[2] = w1 * local_density * (1.0f + u[2] / c_sq
+ (u[2] * u[2]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[3] = w1 * local_density * (1.0f + u[3] / c_sq
+ (u[3] * u[3]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[4] = w1 * local_density * (1.0f + u[4] / c_sq
+ (u[4] * u[4]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
// Diagonal speeds: weight w2
d_equ[5] = w2 * local_density * (1.0f + u[5] / c_sq
+ (u[5] * u[5]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[6] = w2 * local_density * (1.0f + u[6] / c_sq
+ (u[6] * u[6]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[7] = w2 * local_density * (1.0f + u[7] / c_sq
+ (u[7] * u[7]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
d_equ[8] = w2 * local_density * (1.0f + u[8] / c_sq
+ (u[8] * u[8]) / (2.0f * c_sq * c_sq)
- u_sq / (2.0f * c_sq));
// Relaxation step
node[i].c = (tmp_node[i].c + param.omega * (d_equ[0] - tmp_node[i].c));
node[i].e = (tmp_node[i].e + param.omega * (d_equ[1] - tmp_node[i].e));
node[i].n = (tmp_node[i].n + param.omega * (d_equ[2] - tmp_node[i].n));
node[i].w = (tmp_node[i].w + param.omega * (d_equ[3] - tmp_node[i].w));
node[i].s = (tmp_node[i].s + param.omega * (d_equ[4] - tmp_node[i].s));
node[i].ne = (tmp_node[i].ne + param.omega * (d_equ[5] - tmp_node[i].ne));
node[i].nw = (tmp_node[i].nw + param.omega * (d_equ[6] - tmp_node[i].nw));
node[i].sw = (tmp_node[i].sw + param.omega * (d_equ[7] - tmp_node[i].sw));
node[i].se = (tmp_node[i].se + param.omega * (d_equ[8] - tmp_node[i].se));
}
}
}
return 1;
}

In general, you should make sure that data used on different cpus are not shared (easy) and are not on the same cache line (false sharing, see for example here: False Sharing is No Fun). Data used by the same cpu should be close together to benefit from caches.

Current GPUs are notoriously depending about memory layout. Without more details about your application here are some things I would suggest you explore:
Unit-stride access is very important so GPUs prefer “structs of arrays” to “arrays of structures”. As you did moving field “blocked” into vector “obstacle”, it should be advantageous to convert all of the fields of “grid_cell” into separate vectors. This should show benefit on CPU as well for loops that don’t access all of the fields.
If “obstacle” is very sparse (which I guess is unlikely) then moving it to its own vector is particularly value. GPUs like CPUs will load more than one word from the memory system either in cache lines or some other form and you waste bandwidth when you don’t need some of the data. For many system memory bandwidth is the bottleneck resource so any way to reduce bandwidth helps.
This is more speculative, but now that you are writing all of the output vector, it is possible the memory subsystem is avoiding reading values in “node” that will simply be overwritten
On some systems, the on-chip memory is split into banks and having an odd number of fields within your structure may help remove bank conflicts.
Some systems will also “vectorize” loads and stores so again removing “blocked” from the structure might enable more vectorization. The shift to struct-of-arrays mitigates this worry.
Thanks for your interest in C++ AMP.
David Callahan
http://blogs.msdn.com/b/nativeconcurrency/ C++ AMP Team Blog

Some small generic tops:
Any data structure that is shared across multiple processors should be read only.
Any data structure that requires modification is unique to the processor and does not share memory locality with data that is required by another processor.
Make sure your memory is arranged so that your code scans serially through it (not taking huge steps or jumping around).

For anyone looking into this topic some hints.
Lattice-Boltzmann is generally bandwidth limited. This means its performance depends mainly on the amount of data that can be loaded from and written to memory.
Use a highly efficient compiled programming language: C or C++ are good choices for CPU-based implementations.
Choose an architecture with a high bandwidth. For a CPU this means high clock RAM and a lot of memory channels (quad-channel or more).
This makes it crucial to use an appropriate linear memory layout that makes effective use of cache prefetching: The data is arranged in memory in small portions, so called cache lines. Whenever a processor accesses an element the entire cache line (on modern architectures 64 Bytes) it lies in are loaded. This means 8 double or 16 float values are loaded at once! While I have not found this to be a problem for multi-core processors as they share the L3 cache this should lead to problems on many-core architectures as changes to the same cache line have to be kept in sync and problems arise when other processors are working on data that another processor is working on (false sharing). This can be avoided by introducing padding, meaning you add elements you won't use to fill the rest of the cache line. Assume you want to update a cell with a discretisation with 27 speeds for the D3Q27-lattice then in the case of doubles (8 Bytes) the data lies on 4 distinct cache lines. You should add 5 doubles of padding to match the 32 Bytes (4*8 Bytes).
unsigned int const PAD = (64 - sizeof(double)*D3Q27.SPEEDS % 64); ///< padding: number of doubles
size_t const MEM_SIZE_POP = sizeof(double)*NZ*NY*NX*(SPEEDS+PAD); ///< amount of memory to be allocated
Most compilers naturally align the start of the array with a cache line so you don't have to take care of that.
The linear indices are inconvenient for accessing. Therefore you should design the accessing as efficient as possible. You could write a wrapper class. In any case inline those functions, meaning every call is replaced by their definition in the code.
inline size_t const D3Q27_PopIndex(unsigned int const x, unsigned int const y, unsigned int const z, unsigned int const d)
{
return (D3Q27.SPEEDS + D3Q27.PAD)*(NX*(NY*z + y) + x) + D3Q27.SPEEDS*p + d;
}
Furthermore cache locality can be increased by maximising the ratio between computation and communication for example using three-dimensional spatial loop blocking (Scaling issues with OpenMP), meaning every code works on a cube of cells instead of a single cell.
Generally implementations make use of two distinct populations A and B and perform collision and streaming from one implementation into another. This means every value in memory exists twice, once pre- and once post-collision. There exist different strategies for recombining steps and storing in such a way that you only have to keep one population copy in memory. For instance see the A-A pattern as proposed by P. Bailey et al. - "Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors" (https://www2.cs.arizona.edu/people/pbailey/Accelerating_GPU_LBM.pdf) or the Esoteric Twist by M. Geier & M. Schönherr - "Esoteric Twist: An Efficient in-Place Streaming Algorithmus for the Lattice Boltzmann Method on Massively Parallel Hardware" (https://pdfs.semanticscholar.org/ea64/3d63667900b60e6ff49f2746211700e63802.pdf). I have implemented the first with the use of macros meaning every access of a population calls a macro of the form:
#define O_E(a,b) a*odd + b*(!odd)
#define READ_f_0 D3Q27_PopIndex(x, y, z, 0, p)
#define READ_f_1 D3Q27_PopIndex(O_E(x_m, x), y, z, O_E( 1, 2), p)
#define READ_f_2 D3Q27_PopIndex(O_E(x_p, x), y, z, O_E( 2, 1), p)
...
#define WRITE_f_0 D3Q27_PopIndex(x, y, z, 0, p)
#define WRITE_f_1 D3Q27_PopIndex(O_E(x_p, x), y, z, O_E( 2, 1), p)
#define WRITE_f_2 D3Q27_PopIndex(O_E(x_m, x), y, z, O_E( 1, 2), p)
...
If you have multiple interaction populations use grid merging. Lay the indices out linearly in memory and put two distinct populations side by side. The accessing of population p works then as follows:
inline size_t const D3Q27_PopIndex(unsigned int const x, unsigned int const y, unsigned int const z, unsigned int const d, unsigned int const p = 0)
{
return (D3Q27.SPEEDS*D3Q27.NPOP + D3Q27.PAD)*(NX*(NY*z + y) + x) + D3Q27.SPEEDS*p + d;
}
For a regular grid make the algorithm as predictable as possible. Let every cell perform collision and streaming and then do the boundary conditions in reverse afterwards. If you have many cells that do not contribute directly to the algorithm omit them with a logical mask that you can store in the padding as well!
Make everything know to the compiler at compilation time: Template for example boundary conditions with a function that takes care of index changes so you don't have to rewrite every boundary condition.
Modern architectures have registers that can perform SIMD operations, so the same instruction on multiple data. Some processors (AVX-512) can process up to 512 bits of data and thus 32 doubles almost as fast as a single number. This seems to be very attractive for LBM in particular ever since gathering and scattering have been introduced (https://en.wikipedia.org/wiki/Gather-scatter_(vector_addressing)) but with the current bandwidth limitations (maybe it is worth it with DDR5 and processors with few cores) this is in my opinion not worth the hassle: The single core performance and parallel scaling is better (M. Wittmann et al. - "Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis" - https://arxiv.org/abs/1711.11468) but the overall algorithm performs not any better as it is bandwidth limited. So it only makes sense on architectures that are limited by the computing capacities rather than the bandwidth. On the Xeon Phi architecture the results seem to be remarkable Robertsen et al. - "High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor" (https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5072).
In my opinion most of this is not worth the effort for simple 2D implementations. Do the easy optimisations there, loop blocking, a linear memory layout but forget about the more complex access patterns. In 3D the effect can be enormous: I have achieved up to 95% parallel scalability and an overall performance of over 150 Mlups with a D3Q19 on a modern 12-core processor. For more performance switch to more adequate architectures like GPUs with CUDA C that are optimised for bandwidth.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js