Efficiently evaluate large polynomials using SIMD

Efficiently evaluate large polynomials using SIMD - c++

I have quite large (20-40 degree) slowly-converging (sometimes) floating-point polynomials. I'd like to optimize their evaluation using SIMD (SSE2, AVX1, AVX-512). I need both float-32 and double-64 solutions.
Value of coefficients are constants given in advance, and value of X to evaluate poly at is given as function argument.
Important note - I have just single input X for my function. So I can't do vertical optimization by computing poly for 8-16 Xs same time. It means I need some horizontal optimization within evaluation for single X.
I created related question that helps me to compute powers of X (e.g. X^1, X^2, ..., X^8), that are needed for SIMD evaluation.
It is obvious that SIMD should be used only after some threshold of polynomial degree, for quite small polys straight generic (non-SIMD) Horner's (or Estrin's) based method can be used like here. Also SIMD width (128 or 256 or 512) should be chosen based on poly degree.
Below I implemented AVX-256-Float32 variant using a kind of modified Horner's Method suited for SIMD (multiplying by x^8 instead of x^1). Credits to #PeterCordes for fast horizontal sum tutorial. Click on try-it-online link, it contains larger code which also has reference simple evaluation for comparison and time measurements:
Try it online!
template <size_t S, size_t I, typename MT = __m256, size_t Cnt>
inline MT EvalPoly8xF32Helper(MT xhi,
std::array<float, Cnt> const & A, MT r = _mm256_undefined_ps()) {
size_t constexpr K = 8;
if constexpr(I + K >= S)
r = _mm256_load_ps(&A[I]);
else {
#ifdef __FMA__
r = _mm256_fmadd_ps(r, xhi, _mm256_load_ps(&A[I]));
#else
r = _mm256_add_ps(_mm256_mul_ps(r, xhi),
_mm256_load_ps(&A[I]));
#endif
}
if constexpr(I < K)
return r;
else
return EvalPoly8xF32Helper<S, I - K>(xhi, A, r);
}
inline float _mm_fast_hsum_ps(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v);
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums);
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
template <size_t S, size_t Cnt>
inline float EvalPoly8xF32(
float x, std::array<float, Cnt> const & A) {
auto constexpr K = 8;
auto const x2 = x * x, x4 = x2 * x2, x8 = x4 * x4, x3 = x2 * x;
auto const powsx = _mm256_setr_ps(
1, x, x2, x3, x4, x4 * x, x4 * x2, x4 * x3);
auto r0 = EvalPoly8xF32Helper<S, (S - 1) / K * K>(
_mm256_set1_ps(x8), A);
r0 = _mm256_mul_ps(r0, powsx);
return _mm_fast_hsum_ps(_mm_add_ps(
_mm256_castps256_ps128(r0), _mm256_extractf128_ps(r0, 1)));
}
As one can see SIMD version gives quite large speedup compared to reference simple implementation. For AVX1-256-float32 and degree 32 case it gives around 4.5x times speedup (for degree 16 it gives 1.8x speedup which is also good)! Obviously that even just using FMA instructions inside reference implementations will already improve computation speed noticeably.
My question is whether you can suggest a faster method of evaluating polynomial, or even some ready-made code or library, or any optimizations to my code.
Most common target CPU that will be used is Intel Xeon Gold 6230 that has AVX-512, so I need to optimize code for it.

Related

Why two consecutive gather instruction perform worse than equivalent elementary ops?

I am upgrading some code from SSE to AVX2. In general I can see that gather instructions are quite useful and benefit performance. However I encountered a case where gather instructions are less efficient than decomposing the gather operations into simpler ones.
In the code below, I have a vector of int32 b, a vector of double xi and 4 int32 indices packed in a 128 bit register bidx. I need to gather first from vector b, than from vector xi. I.e., in pseudo code, I need to do:
__m128i i = b[idx];
__m256d x = xi[i];
In the function below, I implement this in two ways using an #ifdef: via gather instructions, yielding a throughput of 290 Miter/sec and via elementary operations, yielding a throughput of 325 Miter/sec.
Can somebody explain what is going on? Thanks
inline void resolve( const __m256d& z, const __m128i& bidx, int32_t j
, const int32_t *b, const double *xi, int32_t* ri )
{
__m256d x;
__m128i i;
#if 0 // this code uses two gather instructions in sequence
i = _mm_i32gather_epi32(b, bidx, 4)); // i = b[bidx]
x = _mm256_i32gather_pd(xi, i, 8); // x = xi[i]
#else // this code does not use gather instructions
union {
__m128i vec;
int32_t i32[4];
} u;
x = _mm256_set_pd
( xi[(u.i32[3] = b[_mm_extract_epi32(bidx,3)])]
, xi[(u.i32[2] = b[_mm_extract_epi32(bidx,2)])]
, xi[(u.i32[1] = b[_mm_extract_epi32(bidx,1)])]
, xi[(u.i32[0] = b[_mm_cvtsi128_si32(bidx) ])]
);
i = u.vec;
#endif
// here we use x and i
__m256 ps256 = _mm256_castpd_ps(_mm256_cmp_pd(z, x, _CMP_LT_OS));
__m128 lo128 = _mm256_castps256_ps128(ps256);
__m128 hi128 = _mm256_extractf128_ps(ps256, 1);
__m128 blend = _mm_shuffle_ps(lo128, hi128, 0 + (2<<2) + (0<<4) + (2<<6));
__m128i lt = _mm_castps_si128(blend); // this is 0 or -1
i = _mm_add_epi32(i, lt);
_mm_storeu_si128(reinterpret_cast<__m128i*>(ri)+j, i);
}

Since your 'resolve' function is marked as inline I suppose it's called in a high frequency loop. Then you might also have a look at the dependencies of the input parameters from each other outside the 'resolve' function. The compiler might be able to optimize the inlined code better across loop boundaries when using the scalar code variant.

Compute Mutliplication of complex numbers

What I am trying to do ultimately is multiplying two complex numbers like this:
z1 = R1 + I1*j
z2 = R2 + I2*j
z3 = z1 * z2 = (R1*R2 - I1*I2) (R1*I2 + R2*I1)*j;
But what I have are two separate vectors for the real and complex part of both those complex numbers. So something like this:
v1 = [R1, R2, R3, R4 ... Rn] of z1
v2 = [I1, I2, I3, I4 ... In] of z1
v1 = [R1, R2, R3, R4 ... Rn] of z2
v2 = [I1, I2, I3, I4 ... In] of z2
So when I am trying to calculate z3 now, I do this:
foo (std::vector<double> real1, std::vector<double> imag1,
std::vector<double> real2, std::vector<double> imag2)
{
std::vector<double> realResult;
std::vector<double> imagResult;
for (size_t i = 0; i < real1.size(); i++)
{
realResult.push_back(real1[i]*real2[i] - imag1[i]*imag2[i]);
imagResult.push_back(real1[i]*imag2[i] + real2[i]*imag1[i]);
}
//And so on
}
Now, this function is eating a lot of time. There sure is another way of doing that can you think of something that I can use?

You might be able to make use of std::complex. This probably implements operations you require at least close to as well as they can be implemented.
EDIT (In reply to comment):
I would do this:
size_t num_items = real1.size();
std::vector<double> realResult;
realResult.reserve(num_items);
std::vector<double> imagResult;
imagResult.reserve(num_items);
for (size_t i = 0; i < num_items; ++i) {
// lalala not re-sizeing any vectors yey!
realResult.push_back(real1[i] * real2[i] - imag1[i] * imag2[i]);
imagResult.push_back(real1[i] * imag2[i] + real2[i] * imag1[i]);
}
Otherwise if you have a large input array and you are doing a lot of multiplication on doubles I'm afraid that might just be slow. Best you can do is mess around with getting things contiguous in memory for bonus cache points. Impossible to really say without profiling the code exactly what might work best.

Pass in parameter as const std::vector<double>& to avoid unnecessary copy
You may also consider computing each multiplication in parallel, if N is big enough, the overhead of parallel computing is worthwhile

Use std::valarray of std::complex. It is simple and optimized for arithmetic operations
foo(std::valarray<std::complex<double>> & z1,
std::valarray<std::complex<double>> & z2)
{
auto z3 = z1 * z2; // applies to each element of two valarrays, or a valarray and a value
// . . .
}
EDIT: Convert vectors to valarray
std::valarray<std::complex<double>> z1(real1.size());
for (size_t i = 0; i < z1.size(); ++i)
z1[i] = std::complex<double>(real1[i], imag1[i]);

Faster computation of (approximate) variance needed

I can see with the CPU profiler, that the compute_variances() is the bottleneck of my project.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
75.63 5.43 5.43 40 135.75 135.75 compute_variances(unsigned int, std::vector<Point, std::allocator<Point> > const&, float*, float*, unsigned int*)
19.08 6.80 1.37 readDivisionSpace(Division_Euclidean_space&, char*)
...
Here is the body of the function:
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims) {
for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
float delta, n;
for (size_t i = 0; i < points.size(); ++i) {
n = 1.0 + i;
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, points[0].dim(), t, split_dims);
}
where kthLargest() doesn't seem to be a problem, since I see that:
0.00 7.18 0.00 40 0.00 0.00 kthLargest(float*, int, int, unsigned int*)
The compute_variances() takes a vector of vectors of floats (i.e. a vector of Points, where Points is a class I have implemented) and computes the variance of them, in each dimension (with regard to the algorithm of Knuth).
Here is how I call the function:
float avg[(*points)[0].dim()];
float var[(*points)[0].dim()];
size_t split_dims[t];
compute_variances(t, *points, avg, var, split_dims);
The question is, can I do better? I would really happy to pay the trade-off between speed and approximate computation of variances. Or maybe I could make the code more cache friendly or something?
I compiled like this:
g++ main_noTime.cpp -std=c++0x -p -pg -O3 -o eg
Notice, that before edit, I had used -o3, not with a capital 'o'. Thanks to ypnos, I compiled now with the optimization flag -O3. I am sure that there was a difference between them, since I performed time measurements with one of these methods in my pseudo-site.
Note that now, compute_variances is dominating the overall project's time!
[EDIT]
copute_variances() is called 40 times.
Per 10 calls, the following hold true:
points.size() = 1000 and points[0].dim = 10000
points.size() = 10000 and points[0].dim = 100
points.size() = 10000 and points[0].dim = 10000
points.size() = 100000 and points[0].dim = 100
Each call handles different data.
Q: How fast is access to points[i][d]?
A: point[i] is just the i-th element of std::vector, where the second [], is implemented as this, in the Point class.
const FT& operator [](const int i) const {
if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning
}
where coords is a std::vector of float values. This seems a bit heavy, but shouldn't the compiler be smart enough to predict correctly that the branch is always true? (I mean after the cold start). Moreover, the std::vector.at() is supposed to be constant time (as said in the ref). I changed this to have only .at() in the body of the function and the time measurements remained, pretty much, the same.
The division in the compute_variances() is for sure something heavy! However, Knuth's algorithm was a numerical stable one and I was not able to find another algorithm, that would de both numerical stable and without division.
Note that I am not interesting in parallelism right now.
[EDIT.2]
Minimal example of Point class (I think I didn't forget to show something):
class Point {
public:
typedef float FT;
...
/**
* Get dimension of point.
*/
size_t dim() const {
return coords.size();
}
/**
* Operator that returns the coordinate at the given index.
* #param i - index of the coordinate
* #return the coordinate at index i
*/
FT& operator [](const int i) {
return coords.at(i);
//it's the same if I have the commented code below
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
/**
* Operator that returns the coordinate at the given index. (constant)
* #param i - index of the coordinate
* #return the coordinate at index i
*/
const FT& operator [](const int i) const {
return coords.at(i);
/*if (i < (int) coords.size() && i >= 0)
return coords.at(i);
else {
std::cout << "Error at Point::[]" << std::endl;
exit(1);
}
return coords[0]; // Clear -Wall warning*/
}
private:
std::vector<FT> coords;
};

1. SIMD
One easy speedup for this is to use vector instructions (SIMD) for the computation. On x86 that means SSE, AVX instructions. Based on your word length and processor you can get speedups of about x4 or even more. This code here:
for (size_t d = 0; d < points[0].dim(); ++d) {
delta = (points[i][d]) - avg[d];
avg[d] += delta / n;
var[d] += delta * ((points[i][d]) - avg[d]);
}
can be sped-up by doing the computation for four elements at once with SSE. As your code really only processes one single element in each loop iteration, there is no bottleneck. If you go down to 16bit short instead of 32bit float (an approximation then), you can fit eight elements in one instruction. With AVX it would be even more, but you need a recent processor for that.
It is not the solution to your performance problem, but just one of them that can also be combined with others.
2. Micro-parallelizm
The second easy speedup when you have that many loops is to use parallel processing. I typically use Intel TBB, others might suggest OpenMP instead. For this you would probably have to change the loop order. So parallelize over d in the outer loop, not over i.
You can combine both techniques, and if you do it right, on a quadcore with HT you might get a speed-up of 25-30 for the combination without any loss in accuracy.
3. Compiler optimization
First of all maybe it is just a typo here on SO, but it needs to be -O3, not -o3!
As a general note, it might be easier for the compiler to optimize your code if you declare the variables delta, n within the scope where you actually use them. You should also try the -funroll-loops compiler option as well as -march. The option to the latter depends on your CPU, but nowadays typically -march core2 is fine (also for recent AMDs), and includes SSE optimizations (but I would not trust the compiler just yet to do that for your loop).

The big problem with your data structure is that it's essentially a vector<vector<float> >. That's a pointer to an array of pointers to arrays of float with some bells and whistles attached. In particular, accessing consecutive Points in the vector doesn't correspond to accessing consecutive memory locations. I bet you see tons and tons of cache misses when you profile this code.
Fix this before horsing around with anything else.
Lower-order concerns include the floating-point division in the inner loop (compute 1/n in the outer loop instead) and the big load-store chain that is your inner loop. You can compute the means and variances of slices of your array using SIMD and combine them at the end, for instance.
The bounds-checking once per access probably doesn't help, either. Get rid of that too, or at least hoist it out of the inner loop; don't assume the compiler knows how to fix that on its own.

Here's what I would do, in guesstimated order of importance:
Return the floating-point from the Point::operator[] by value, not by reference.
Use coords[i] instead of coords.at(i), since you already assert that it's within bounds. The at member checks the bounds. You only need to check it once.
Replace the home-baked error indication/checking in the Point::operator[] with an assert. That's what asserts are for. They are nominally no-ops in release mode - I doubt that you need to check it in release code.
Replace the repeated division with a single division and repeated multiplication.
Remove the need for wasted initialization by unrolling the first two iterations of the outer loop.
To lessen impact of cache misses, run the inner loop alternatively forwards then backwards. This at least gives you a chance at using some cached avg and var. It may in fact remove all cache misses on avg and var if prefetch works on reverse order of iteration, as it well should.
On modern C++ compilers, the std::fill and std::copy can leverage type alignment and have a chance at being faster than the C library memset and memcpy.
The Point::operator[] will have a chance of getting inlined in the release build and can reduce to two machine instructions (effective address computation and floating point load). That's what you want. Of course it must be defined in the header file, otherwise the inlining will only be performed if you enable link-time code generation (a.k.a. LTO).
Note that the Point::operator[]'s body is only equivalent to the single-line
return coords.at(i) in a debug build. In a release build the entire body is equivalent to return coords[i], not return coords.at(i).
FT Point::operator[](int i) const {
assert(i >= 0 && i < (int)coords.size());
return coords[i];
}
const FT * Point::constData() const {
return &coords[0];
}
void compute_variances(size_t t, const std::vector<Point>& points, float* avg,
float* var, size_t* split_dims)
{
assert(points.size() > 0);
const int D = points[0].dim();
// i = 0, i_n = 1
assert(D > 0);
#if __cplusplus >= 201103L
std::copy_n(points[0].constData(), D, avg);
#else
std::copy(points[0].constData(), points[0].constData() + D, avg);
#endif
// i = 1, i_n = 0.5
if (points.size() >= 2) {
assert(points[1].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[1][d] - avg[d];
avg[d] += delta * 0.5f;
var[d] = delta * (points[1][d] - avg[d]);
}
} else {
std::fill_n(var, D, 0.0f);
}
// i = 2, ...
for (size_t i = 2; i < points.size(); ) {
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = 0; d < D; ++d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
if (i >= points.size()) break;
{
const float i_n = 1.0f / (1.0f + i);
assert(points[i].dim() == D);
for (int d = D - 1; d >= 0; --d) {
float const delta = points[i][d] - avg[d];
avg[d] += delta * i_n;
var[d] += delta * (points[i][d] - avg[d]);
}
}
++ i;
}
/* Find t dimensions with largest scaled variance. */
kthLargest(var, D, t, split_dims);
}

for (size_t d = 0; d < points[0].dim(); d++) {
avg[d] = 0.0;
var[d] = 0.0;
}
This code could be optimized by simply using memset. The IEEE754 representation of 0.0 in 32bits is 0x00000000. If the dimension is big, it worth it.
Something like:
memset((void*)avg, 0, points[0].dim() * sizeof(float));
In your code, you have a lot of calls to points[0].dim(). It would be better to call once at the beginning of the function and store in a variable. Likely, the compiler already does this (since you are using -O3).
The division operations are a lot more expensive (from clock-cycle POV) than other operations (addition, subtraction).
avg[d] += delta / n;
It could make sense, to try to reduce the number of divisions: use partial non-cumulative average calculation, that would result in Dim division operation for N elements (instead of N x Dim); N < points.size()
Huge speedup could be achieved, using Cuda or OpenCL, since the calculation of avg and var could be done simultaneously for each dimension (consider using a GPU).

Another optimization is cache optimization including both data cache and instruction cache.
High level optimization techniques
Data Cache optimizations
Example of data cache optimization & unrolling
for (size_t d = 0; d < points[0].dim(); d += 4)
{
// Perform loading all at once.
register const float p1 = points[i][d + 0];
register const float p2 = points[i][d + 1];
register const float p3 = points[i][d + 2];
register const float p4 = points[i][d + 3];
register const float delta1 = p1 - avg[d+0];
register const float delta2 = p2 - avg[d+1];
register const float delta3 = p3 - avg[d+2];
register const float delta4 = p4 - avg[d+3];
// Perform calculations
avg[d + 0] += delta1 / n;
var[d + 0] += delta1 * ((p1) - avg[d + 0]);
avg[d + 1] += delta2 / n;
var[d + 1] += delta2 * ((p2) - avg[d + 1]);
avg[d + 2] += delta3 / n;
var[d + 2] += delta3 * ((p3) - avg[d + 2]);
avg[d + 3] += delta4 / n;
var[d + 3] += delta4 * ((p4) - avg[d + 3]);
}
This differs from classic loop unrolling in that loading from the matrix is performed as a group at the top of the loop.
Edit 1:
A subtle data optimization is to place the avg and var into a structure. This will ensure that the two arrays are next to each other in memory, sans padding. The data fetching mechanism in processors like datums that are very close to each other. Less chance for data cache miss and better chance to load all of the data into the cache.

You could use Fixed Point math instead of floating point math as an optimization.
Optimization via Fixed Point
Processors love to manipulate integers (signed or unsigned). Floating point may take extra computing power due to the extraction of the parts, performing the math, then reassemblying the parts. One mitigation is to use Fixed Point math.
Simple Example: meters
Given the unit of meters, one could express lengths smaller than a meter by using floating point, such as 3.14159 m. However, the same length can be expressed in a unit of finer detail like millimeters, e.g. 3141.59 mm. For finer resolution, a smaller unit is chosen and the value multiplied, e.g. 3,141,590 um (micrometers). The point is choosing a small enough unit to represent the floating point accuracy as an integer.
The floating point value is converted at input into Fixed Point. All data processing occurs in Fixed Point. The Fixed Point value is convert to Floating Point before outputting.
Power of 2 Fixed Point Base
As with converting from floating point meters to fixed point millimeters, using 1000, one could use a power of 2 instead of 1000. Selecting a power of 2 allows the processor to use bit shifting instead of multiplication or division. Bit shifting by a power of 2 is usually faster than multiplication or division.
Keeping with the theme and accuracy of millimeters, we could use 1024 as the base instead of 1000. Similarly, for higher accuracy, use 65536 or 131072.
Summary
Changing the design or implementation to used Fixed Point math allows the processor to use more integral data processing instructions than floating point. Floating point operations consume more processing power than integral operations in all but specialized processors. Using powers of 2 as the base (or denominator) allows code to use bit shifting instead of multiplication or division. Division and multiplication take more operations than shifting and thus shifting is faster. So rather than optimizing code for execution (such as loop unrolling), one could try using Fixed Point notation rather than floating point.

Point 1.
You're computing the average and the variance at the same time.
Is that right?
Don't you have to calculate the average first, then once you know it, calculate the sum of squared differences from the average?
In addition to being right, it's more likely to help performance than hurt it.
Trying to do two things in one loop is not necessarily faster than two consecutive simple loops.
Point 2.
Are you aware that there is a way to calculate average and variance at the same time, like this:
double sumsq = 0, sum = 0;
for (i = 0; i < n; i++){
double xi = x[i];
sum += xi;
sumsq += xi * xi;
}
double avg = sum / n;
double avgsq = sumsq / n
double variance = avgsq - avg*avg;
Point 3.
The inner loops are doing repetitive indexing.
The compiler might be able to optimize that to something minimal, but I wouldn't bet my socks on it.
Point 4.
You're using gprof or something like it.
The only reasonably reliable number to come out of it is self-time by function.
It won't tell you very well how time is spent inside the function.
I and many others rely on this method, which takes you straight to the heart of what takes time.

C++ (and maths) : fast approximation of a trigonometric function

I know this is a recurring question, but I haven't really found a useful answer yet. I'm basically looking for a fast approximation of the function acos in C++, I'd like to know if I can significantly beat the standard one.
But some of you might have insights on my specific problem: I'm writing a scientific program which I need to be very fast. The complexity of the main algorithm boils down to computing the following expression (many times with different parameters):
sin( acos(t_1) + acos(t_2) + ... + acos(t_n) )
where the t_i are known real (double) numbers, and n is very small (like smaller than 6). I need a precision of at least 1e-10. I'm currently using the standard sin and acos C++ functions.
Do you think I can significantly gain speed somehow? For those of you who know some maths, do you think it would be smart to expand that sine in order to get an algebraic expression in terms of the t_i (only involving square roots)?
Thank you your your answers.

The code below provides simple implementations of sin() and acos() that should satisfy your accuracy requirements and that you might want to try. Please note that the math library implementation on your platform is very likely highly tuned for the specific hardware capabilities of that platform and is probably also coded in assembly for maximum efficiency, so simple compiled C code not catering to specifics of the hardware is unlikely to provide higher performance, even when the accuracy requirements are somewhat relaxed from full double precision. As Viktor Latypov points out, it may also be worthwhile to search for algorithmic alternatives that do not require expensive calls to transcendental math functions.
In the code below I have tried to stick to simple, portable constructs. If your compiler supports the rint() function [specified by C99 and C++11] you might want to use that instead of my_rint(). On some platforms, the call to floor() can be expensive since it requires dynamic changing of machine state. The functions my_rint(), sin_core(), cos_core(), and asin_core() would want to be inlined for best performance. Your compiler may do that automatically at high optimization levels (e.g. when compiling with -O3), or you could add an appropriate inlining attribute to these functions, e.g. inline or __inline depending on your toolchain.
Not knowing anything about your platform I opted for simple polynomial approximations, which are evaluated using Estrin's scheme plus Horner's scheme. See Wikipedia for a description of these evaluation schemes:
http://en.wikipedia.org/wiki/Estrin%27s_scheme ,
http://en.wikipedia.org/wiki/Horner_scheme
The approximations themselves are of the minimax type and were custom generated for this answer with the Remez algorithm:
http://en.wikipedia.org/wiki/Minimax_approximation_algorithm ,
http://en.wikipedia.org/wiki/Remez_algorithm
The identities used in the argument reduction for acos() are noted in the comments, for sin() I used a Cody/Waite-style argument reduction, as described in the following book:
W. J. Cody, W. Waite, Software Manual for the Elementary Functions. Prentice-Hall, 1980
The error bounds mentioned in the comments are approximate, and have not been rigorously tested or proven.
/* not quite rint(), i.e. results not properly rounded to nearest-or-even */
double my_rint (double x)
{
double t = floor (fabs(x) + 0.5);
return (x < 0.0) ? -t : t;
}
/* minimax approximation to cos on [-pi/4, pi/4] with rel. err. ~= 7.5e-13 */
double cos_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using Estrin's scheme */
return (-2.7236370439787708e-7 * x2 + 2.4799852696610628e-5) * x8 +
(-1.3888885054799695e-3 * x2 + 4.1666666636943683e-2) * x4 +
(-4.9999999999963024e-1 * x2 + 1.0000000000000000e+0);
}
/* minimax approximation to sin on [-pi/4, pi/4] with rel. err. ~= 5.5e-12 */
double sin_core (double x)
{
double x4, x2, t;
x2 = x * x;
x4 = x2 * x2;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return ((2.7181216275479732e-6 * x2 - 1.9839312269456257e-4) * x4 +
(8.3333293048425631e-3 * x2 - 1.6666666640797048e-1)) * x2 * x + x;
}
/* minimax approximation to arcsin on [0, 0.5625] with rel. err. ~= 1.5e-11 */
double asin_core (double x)
{
double x8, x4, x2;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;
/* evaluate polynomial using a mix of Estrin's and Horner's scheme */
return (((4.5334220547132049e-2 * x2 - 1.1226216762576600e-2) * x4 +
(2.6334281471361822e-2 * x2 + 2.0596336163223834e-2)) * x8 +
(3.0582043602875735e-2 * x2 + 4.4630538556294605e-2) * x4 +
(7.5000364034134126e-2 * x2 + 1.6666666300567365e-1)) * x2 * x + x;
}
/* relative error < 7e-12 on [-50000, 50000] */
double my_sin (double x)
{
double q, t;
int quadrant;
/* Cody-Waite style argument reduction */
q = my_rint (x * 6.3661977236758138e-1);
quadrant = (int)q;
t = x - q * 1.5707963267923333e+00;
t = t - q * 2.5633441515945189e-12;
if (quadrant & 1) {
t = cos_core(t);
} else {
t = sin_core(t);
}
return (quadrant & 2) ? -t : t;
}
/* relative error < 2e-11 on [-1, 1] */
double my_acos (double x)
{
double xa, t;
xa = fabs (x);
/* arcsin(x) = pi/2 - 2 * arcsin (sqrt ((1-x) / 2))
* arccos(x) = pi/2 - arcsin(x)
* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2))
*/
if (xa > 0.5625) {
t = 2.0 * asin_core (sqrt (0.5 * (1.0 - xa)));
} else {
t = 1.5707963267948966 - asin_core (xa);
}
/* arccos (-x) = pi - arccos(x) */
return (x < 0.0) ? (3.1415926535897932 - t) : t;
}

sin( acos(t1) + acos(t2) + ... + acos(tn) )
boils down to the calculation of
sin( acos(x) ) and cos(acos(x))=x
because
sin(a+b) = cos(a)sin(b)+sin(a)cos(b).
The first thing is
sin( acos(x) ) = sqrt(1-x*x)
Taylor series expansion for the sqrt reduces the problem to polynomial calculations.
To clarify, here's the expansion to n=2, n=3:
sin( acos(t1) + acos(t2) ) = sin(acos(t1))cos(acos(t2)) + sin(acos(t2))cos(acos(t1) = sqrt(1-t1*t1) * t2 + sqrt(1-t2*t2) * t1
cos( acos(t2) + acos(t3) ) = cos(acos(t2)) cos(acos(t3)) - sin(acos(t2))sin(acos(t3)) = t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)
sin( acos(t1) + acos(t2) + acos(t3)) =
sin(acos(t1))cos(acos(t2) + acos(t3)) + sin(acos(t2)+acos(t3) )cos(acos(t1)=
sqrt(1-t1*t1) * (t2*t3 - sqrt(1-t2*t2)*sqrt(1-t3*t3)) + (sqrt(1-t2*t2) * t3 + sqrt(1-t3*t3) * t2 ) * t1
and so on.
The sqrt() for x in (-1,1) can be computed using
x_0 is some approximation, say, zero
x_(n+1) = 0.5 * (x_n + S/x_n) where S is the argument.
EDIT: I mean the "Babylonian method", see Wikipedia's article for details. You will need not more than 5-6 iterations to achieve 1e-10 with x in (0,1).

As Jonas Wielicki mentions in the comments, there isn't much precision trade-offs you can make.
Your best bet is to try and use the processor intrinsics for the functions (if your compiler doesn't do this already) and using some math to reduce the amount of calculations necessary.
Also very important is to keep everything in a CPU-friendly format, make sure there are few cache misses, etc.
If you are calculating large amounts of functions like acos perhaps moving to the GPU is an option for you?

You can try to create lookup tables, and use them instead of standard c++ functions, and see if you see any performance boost.

Significant gains can be made by aligning memory and streaming in the data to your kernel. Most often this dwarfs the gains that can be made by recreating the math functions. Think of how you can improve memory access to/from your kernel operator.
Memory access can be improved by using buffering techniques. This depends on your hardware platform. If you are running this on a DSP, you could DMA your data onto an L2 cache and schedule the instructions so that multiplier units are fully occupied.
If you are on general purpose CPU, most you can do is to use aligned data, feed the cache lines by prefetching. If you have nested loops, then the inner most loop should go back and forth (i.e. iterate forward and then iterate backward) so that cache lines are utilised, etc.
You could also think of ways to parallelize the computation using multiple cores. If you can use a GPU this could significantly improve performance (albeit with a lesser precision).

In addition to what others have said, here are some techniques at speed optimization:
Profile
Find out where in the code most of the time is spent.
Only optimize that area to gain the mose benefit.
Unroll Loops
The processors don't like branches or jumps or changes in the execution path. In general, the processor has to reload the instruction pipeline which uses up time that can be spent on calculations. This includes function calls.
The technique is to place more "sets" of operations in your loop and reduce the number of iterations.
Declare Variables as Register
Variables that are used frequently should be declared as register. Although many members of SO have stated compilers ignore this suggestion, I have found out otherwise. Worst case, you wasted some time typing.
Keep Intense Calculations Short & Simple
Many processors have enough room in their instruction pipelines to hold small for loops. This reduces the amount of time spent reloading the instruction pipeline.
Distribute your big calculation loop into many small ones.
Perform Work on Small Sections of Arrays & Matrices
Many processors have a data cache, which is ultra fast memory very close to the processor. The processor likes to load the data cache once from off-processor memory. More loads require time that can be spent making calculations. Search the web for "Data Oriented Design Cache".
Think in Parallel Processor Terms
Change the design of your calculations so they can be easily adaptable to use with multiple processors. Many CPUs have multiple cores that can execute instructions in parallel. Some processors have enough intelligence to automatically delegate instructions to their multiple cores.
Some compilers can optimize code for parallel processing (look up the compiler options for your compiler). Designing your code for parallel processing will make this optimization easier for the compiler.
Analyze Assembly Listing of Functions
Print out the assembly language listing of your function.
Change the design of your function to match that of the assembly language or to help the compiler generate more optimal assembly language.
If you really need more efficiency, optimize the assembly language and put in as inline assembly code or as a separate module. I generally prefer the latter.
Examples
In your situation, take first 10 terms of the Taylor expansion, calculate them separately and place into individual variables:
double term1, term2, term3, term4;
double n, n1, n2, n3, n4;
n = 1.0;
for (i = 0; i < 100; ++i)
{
n1 = n + 2;
n2 = n + 4;
n3 = n + 6;
n4 = n + 8;
term1 = 4.0/n;
term2 = 4.0/n1;
term3 = 4.0/n2;
term4 = 4.0/n3;
Then sum up all of your terms:
result = term1 - term2 + term3 - term4;
// Or try sorting by operation, if possible:
// result = term1 + term3;
// result -= term2 + term4;
n = n4 + 2;
}

Lets consider two terms first:
cos(a+b) = cos(a)*cos(b) - sin(a)*sin(b)
or cos(a+b) = cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b))
Taking cos to the RHS
a+b = acos( cos(a)*cos(b) - sqrt(1-cos(a)*cos(a))*sqrt(1-cos(b)*cos(b)) ) ... 1
Here cos(a) = t_1 and cos(b) = t_2
a = acos(t_1) and b = acos(t_2)
By substituting in equation (1), we get
acos(t_1) + acos(t_2) = acos(t_1*t_2 - sqrt(1 - t_1*t_1) * sqrt(1 - t_2*t_2))
Here you can see that you have combined two acos into one. So you can pair up all the acos recursively and form a binary tree. At the end, you'll be left with an expression of the form sin(acos(x)) which equals sqrt(1 - x*x).
This will improve the time complexity.
However, I'm not sure about the complexity of calculating sqrt().

Using Perlin noise to create lightning?

Actually I am having several questions related to the subject given in the topic title.
I am already using Perlin functions to create lightning in my application, but I am not totally happy about my implementation.
The following questions are based on the initial and the improved Perlin noise implementations.
To simplify the issue, let's assume I am creating a simple 2D lightning by modulating the height of a horizontal line consisting of N nodes at these nodes using a 1D Perlin function.
As far as I have understood, two subsequent values passed to the Perlin function must differ by at least one, or the resulting two values will be identical. That is because with the simple Perlin implementation, the Random function works with an int argument, and in the improved implementation values are mapped to [0..255] and are then used as index into an array containing the values [0..255] in a random distribution. Is that right?
How do I achieve that the first and the last offset value (i.e. for nodes 0 and N-1) returned by the Perlin function is always 0 (zero)? Right now I am modulation a sine function (0 .. Pi) with my Perlin function to achieve that, but that's not really what I want. Just setting them to zero is not what I want, since I want a nice lightning path w/o jaggies at its ends.
How do I vary the Perlin function (so that I would get two different paths I could use as animation start and end frames for the lightning)? I could of course add a fixed random offset per path calculation to each node value, or use a differently setup permutation table for improved Perlin noise, but are there better options?

That depends on how you implement it and sample from it. Using multiple octaves helps counter integers quite a bit.
The octaves and additional interpolation/sampling done for each provides much of the noise in perlin noise. In theory, you should not need to use different integer positions; you should be able to sample at any point and it will be similar (but not always identical) to nearby values.
I would suggest using the perlin as a multiplier instead of simply additive, and use a curve over the course of the lightning. For example, having perlin in the range [-1.5, 1.5] and a normal curve over the lightning (0 at both ends, 1 in the center), lightning + (perlin * curve) will keep your ends points still. Depending on how you've implemented your perlin noise generator, you may need something like:
lightning.x += ((perlin(lightning.y, octaves) * 2.0) - 0.5) * curve(lightning.y);
if perlin returns [0,1] or
lightning.x += (perlin(lightning.y, octaves) / 128.0) * curve(lightning.y);
if it returns [0, 255]. Assuming lightning.x started with a given value, perhaps 0, that would give a somewhat jagged line that still met the original start and end points.
Add a dimension to the noise for every dimension you add to the lightning. If you're modifying the lightning in one dimension (horizontal jagged), you need 1D perlin noise. If you want to animate it, you need 2D. If you wanted lightning that was jagged on two axis and animated, you'd need 3D noise, and so on.

After reading peachykeen's answer and doing some (more) own research in the internet, I have found the following solution to work for me.
With my implementation of Perlin noise, using a value range of [0.0 .. 1.0] for the lightning path nodes work best, passing the value (double) M / (double) N for node M to the Perlin noise function.
To have a noise function F' return the same value for node 0 and node N-1, the following formula can be applied: F'(M) = ((M - N) * F(N) + N * F (N - M)) / M. In order to have the lightning path offsets begin and end with 0, you simply need to subtract F'(0) from all lightning path offsets after having computed the path.
To randomize the lightning path, before computing the offsets for each path node, a random offset R can be computed and added to the values passed to the noise function, so that a node's offset O = F'(N+R). To animate a lightning, two lightning paths need to be computed (start and end frame), and then each path vertex has to be lerped between its start and end position. Once the end frame has been reached, the end frame becomes the start frame and a new end frame is computed. For a 3D path, for each path node N two offset vectors can be computed that are perpendicular to the path at node N and each other, and can be scaled with two 1D Perlin noise values to lerp the node position from start to end frame position. That may be cheaper than doing 3D Perlin noise and works quite well in my application.
Here is my implementation of standard 1D Perlin noise as a reference (some stuff is virtual because I am using this as base for improved Perlin noise, allowing to use standard or improved Perlin noise in a strategy pattern application. The code has been simplified somewhat as well to make it more concise for publishing it here):
Header file:
#ifndef __PERLIN_H
#define __PERLIN_H
class CPerlin {
private:
int m_randomize;
protected:
double m_amplitude;
double m_persistence;
int m_octaves;
public:
virtual void Setup (double amplitude, double persistence, int octaves, int randomize = -1);
double ComputeNoise (double x);
protected:
double LinearInterpolate (double a, double b, double x);
double CosineInterpolate (double a, double b, double x);
double CubicInterpolate (double v0, double v1, double v2, double v3, double x);
double Noise (int v);
double SmoothedNoise (int x);
virtual double InterpolatedNoise (double x);
};
#endif //__PERLIN_H
Implementation:
#include <math.h>
#include <stdlib.h>
#include "perlin.h"
#define INTERPOLATION_METHOD 1
#ifndef Pi
# define Pi 3.141592653589793240
#endif
inline double CPerlin::Noise (int n) {
n = (n << 13) ^ n;
return 1.0 - ((n * (n * n * 15731 + 789221) + 1376312589) & 0x7fffffff) / 1073741824.0;
}
double CPerlin::LinearInterpolate (double a, double b, double x) {
return a * (1.0 - x) + b * x;
}
double CPerlin::CosineInterpolate (double a, double b, double x) {
double f = (1.0 - cos (x * Pi)) * 0.5;
return a * (1.0 - f) + b * f;
}
double CPerlin::CubicInterpolate (double v0, double v1, double v2, double v3, double x) {
double p = (v3 - v2) - (v0 - v1);
double x2 = x * x;
return v1 + (v2 - v0) * x + (v0 - v1 - p) * x2 + p * x2 * x;
}
double CPerlin::SmoothedNoise (int v) {
return Noise (v) / 2 + Noise (v-1) / 4 + Noise (v+1) / 4;
}
int FastFloor (double v) { return (int) ((v < 0) ? v - 1 : v; }
double CPerlin::InterpolatedNoise (double v) {
int i = FastFloor (v);
double v1 = SmoothedNoise (i);
double v2 = SmoothedNoise (i + 1);
#if INTERPOLATION_METHOD == 2
double v0 = SmoothedNoise (i - 1);
double v3 = SmoothedNoise (i + 2);
return CubicInterpolate (v0, v1, v2, v3, v - i);
#elif INTERPOLATION_METHOD == 1
return CosineInterpolate (v1, v2, v - i);
#else
return LinearInterpolate (v1, v2, v - i);
#endif
}
double CPerlin::ComputeNoise (double v) {
double total = 0, amplitude = m_amplitude, frequency = 1.0;
v += m_randomize;
for (int i = 0; i < m_octaves; i++) {
total += InterpolatedNoise (v * frequency) * amplitude;
frequency *= 2.0;
amplitude *= m_persistence;
}
return total;
}
void CPerlin::Setup (double amplitude, double persistence, int octaves, int randomize) {
m_amplitude = (amplitude > 0.0) ? amplitude : 1.0;
m_persistence = (persistence > 0.0) ? persistence : 2.0 / 3.0;
m_octaves = (octaves > 0) ? octaves : 6;
m_randomize = (randomize < 0) ? (rand () * rand ()) & 0xFFFF : randomize;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js