Looking for a way to speed up a function - c++

I'm trying to speed up a big block of code across many files and found out that one function uses about 70% of the total time. This is because this function is called 477+ million times.
The pointer array par can only be one of two presets, either
par[0] = 0.057;
par[1] = 2.87;
par[2] = -3.;
par[3] = -0.03;
par[4] = -3.05;
par[5] = -3.5;
OR
par[0] = 0.043;
par[1] = 2.92;
par[2] = -3.21;
par[3]= -0.065;
par[4] = -3.00;
par[5] = -2.65;
So I've tried plugging in numbers depending on which preset it is but have failed to find any significant time saves.
The pow and exp functions seem to be called about every time and they take up about 40 and 20 percent of the total time respectively, so only 10% of the total time is used by the parts of this function that aren't pow or exp. Finding ways to speed those up would probably be the best but none of the exponents used in pow are integers except -4 and I don't know if 1/(x*x*x*x) is faster than pow(x, -4).
double Signal::Param_RE_Tterm_approx(double Tterm, double *par) {
double value = 0.;
// time after Che angle peak
if (Tterm > 0.) {
if ( fabs(Tterm/ *par) >= 1.e-2) {
value += -1./(*par)*exp(-1.*Tterm/(*par));
}
else {
value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
}
if ( fabs(Tterm* *(par+1)) >= 1.e-2) {
value += *(par+2)* *(par+1)*pow( 1.+*(par+1)*Tterm, *(par+2)-1. );
}
else {
value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
}
}
// time before Che angle peak
else {
if ( fabs(Tterm/ *(par+3)) >= 1.e-2 ) {
value += -1./ *(par+3) *exp(-1.*Tterm/ *(par+3));
}
else {
value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
}
if ( fabs(Tterm* *(par+4) >= 1.e-2 ) {
value += *(par+5)* *(par+4) *pow( 1.+ *(par+4)*Tterm, *(par+5)-1. );
}
else {
value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
}
}
return value * 1.e9;
}

I first rewrote it to be a bit easier to follow:
#include <math.h>
double Param_RE_Tterm_approx(double Tterm, double const* par) {
double value = 0.;
if (Tterm > 0.) {
// time after Che angle peak
if ( fabs(Tterm/ par[0]) >= 1.e-2) {
value += -1./(par[0])*exp(-1.*Tterm/(par[0]));
} else {
value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
}
if ( fabs(Tterm* par[1]) >= 1.e-2) {
value += par[2]* par[1]*pow( 1.+par[1]*Tterm, par[2]-1. );
} else {
value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
}
} else {
// time before Che angle peak
if ( fabs(Tterm/ par[3]) >= 1.e-2 ) {
value += -1./ par[3] *exp(-1.*Tterm/ par[3]);
} else {
value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
}
if ( fabs(Tterm* par[4]) >= 1.e-2 ) {
value += par[5]* par[4] *pow( 1.+ par[4]*Tterm, par[5]-1. );
} else {
value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
}
}
return value * 1.e9;
}
We can then look at its structure.
There are two main branches -- Tterm negative (before) and positive (after). These correspond to using 0,1,2 or 3,4,5 in the par array.
Then in each case we do two things to add to value. In both cases, for small cases we use a polynomial, and for big cases we use an exponential/power equation.
As a guess, this is because the polynomial is a decent approximation for the exponential for small values -- the error is acceptable. What you should do is confirm that guess -- take a look at the Taylor series expansion of the "big" power/exponent based equation, and see if it agrees with the polynomials somehow. Or check numerically.
If it is the case, this means that this equation has a known amount of error that is acceptable. Quite often there are faster versions of exp or pow that have a known amount of max error; consider using those.
If this isn't the case, there still could be an acceptable amount of error, but the Taylor series approximation can give you "in code" information about what is an acceptable amount of error.
A next step I'd take is to tear the 8 pieces of this equation apart. There is positive/negative, the first and second value+= in each branch, and then the polynomial/exponential case.
I'm guesing the fact that exp is taking ~1/3 the time of pow is because you have 3 calls to pow to 1 call to exp in your function, but you might find out something interesting like "all of our time is actually in the Tterm > 0. case" or what have you.
Now examine call sites. Is there a pattern in the Tterm you are passing this function? Ie, do you tend to pass Tterms in roughly sorted order? If so, you can do the test for which function to call outside of calling this function, and do it in batches.
Simply doing it in batches and compiling with optimization and inlining the bodies of the functions might make a surprising amount of difference; compilers are getting better at vectorizing work.
If that doesn't work, you can start threading things off. On a modern computer you can have 4-60 threads solving this problem independently, and this problem looks like you'd get nearly linear speedup. A basic threading library, like TBB, would be good for this kind of task.
For the next step up, if you are getting large batches of data and you need to do a lot of processing, you can stuff it onto a GPU and solve it there. Sadly, GPU<->RAM communication is small, so simply doing the math in this function on the GPU and reading/writing back and forth with RAM won't give you much if any performance. But if more work than just this can go on the GPU, it might be worth it.

The only 10% of the total time is used by the parts of this function that aren't pow or exp.
If your function performance bottleneck is exp(), pow() execution, consider using vector instructions in your calculations. All modern processors support at least SSE2 instruction set, so this approach will definitely give at least ~2x speed up, because your calculation could be easily vectorized.
I recommend you to use this c++ vectorization library, which contains all standard mathematical functions (such as exp and pow) and allows to write code in OOP-style without using assembly language . I used it several times and it must work perfectly in your problem.
If you have GPU, you should also consider trying cuda framework, because, again, your problem could be perfectly vectorized. Moreover, If this function is called 477+ million times, GPU will literally eliminate your problem...

(Partial optimization:)
The longest expression has
Common subexpressions
Polynomial evaluated the costly way.
Pre-define these (perhaps add them to par[]):
a = par[5]*par[4];
b = (par[5]-1.);
c = b*(par[5]-2.)/2.;
d = c*(par[5]-3.)/3.;
Then, for example, the longest expression becomes:
e = par[4]*Tterm;
value += a*(((d*e + c)*e + b)*e + 1.);
And simplify the rest.
If the expressions are curve-fitting approximations, why not do also with
value += -1./(*par)*exp(-1.*Tterm/(*par));
You should also ask whether all 477M iterations are needed.

If you want to explore batching / more optimization opportunities for fusing in computations that depend on these values, try using Halide
I've rewritten your program in Halide here:
#include <Halide.h>
using namespace Halide;
class ParamReTtermApproxOpt : public Generator<ParamReTtermApproxOpt>
{
public:
Input<Buffer<float>> tterm{"tterm", 1};
Input<Buffer<float>> par{"par", 1};
Input<int> ncpu{"ncpu"};
Output<Buffer<float>> output{"output", 1};
Var x;
Func par_inv;
void generate() {
// precompute 1 / par[x]
par_inv(x) = fast_inverse(par(x));
// after che peak
Expr after_che_peak = tterm(x) > 0;
Expr first_term = -par_inv(0) * fast_exp(-tterm(x) * par_inv(0));
Expr second_term = par(2) * par(1) * fast_pow(1 + par(1) * tterm(x), par(2) - 1);
// before che peak
Expr third_term = -par_inv(3) * fast_exp(-tterm(x) * par_inv(3));
Expr fourth_term = par(5) * par(4) * fast_pow(1 + par(4) * tterm(x), par(5) - 1);
// final value
output(x) = 1.e9f * select(after_che_peak, first_term + second_term,
third_term + fourth_term);
}
void schedule() {
par_inv.bound(x, 0, 6);
par_inv.compute_root();
Var xo, xi;
// break x into two loops, one for ncpu tasks
output.split(x, xo, xi, output.extent() / ncpu)
// mark the task loop parallel
.parallel(xo)
// vectorize each thread's computation for 8-wide vector lanes
.vectorize(xi, 8);
output.print_loop_nest();
}
};
HALIDE_REGISTER_GENERATOR(ParamReTtermApproxOpt, param_re_tterm_approx_opt)
I can run 477,000,000 iterations in slightly over one second on my Surface Book (with ncpu=4). Batching is hugely important here since it enables vectorization.
Note that the equivalent program written using double arithmetic is much slower (20x) than float arithmetic. Though Halide doesn't supply fast_ versions for doubles, so this might not be quite apples-to-apples. Regardless, I would check whether you need the extra precision.

Related

How is it possible for some code to take more time to run given the same inputs seemingly just because it's in a loop?

Prelude/Context: I've just started learning c++ and decided to write up some code that would apply a single qubit gate to a quantum register where the register is held in an array called amplitudes and the four elements of the single qubit gate are a,b,c,d. I've tried to write a version that avoids an if statement that appeared in my first pass and to my initial delight, it seemed to have a slight performance enhancement (~10%). If I change the number of qubits in the register or which qubit I target with the gate, I get a similar result. I then tried to make a loop that would perform timing comparisons for a various target qubits and something very strange (to me at least) happened. The alternative function I wrote that avoids the if statement doubled its execution time (from ~0.23 to 0.46 seconds) whereas the function with the if statement had its execution time unaffected (~0.25 seconds). This leads me to my question:
How can code that, when given the same inputs in either case, take longer to execute inside of a loop that iterates those inputs?
For example, if I run a test giving 25 qubits and target qubit 1, the "no if" function wins. Then, if I write a while loop to do a comparison at 25 qubits for each value of target starting at 1, the "no if" function takes double the time to execute even on the first iteration when it receives identical input to the prior case. Interestingly, if I just include the while loop and make it an infinite while loop by putting "True" in the while statement or by commenting out the increment statement target+=1, the function no longer takes double time. This phenomenon requires the loop and the increment from what I can tell.
Code below in case this is a simple coding error in a new language I'm less familiar about. I'm using Visual Studio 2017 community edition with all default settings except that I'm using the "release" build for faster code execution. Commenting out the while statement and the corresponding closing curly brace makes the "no if" timing double.
#include "stdafx.h"
#include <iostream>
#include <time.h>
#include <complex>
void matmulpnoif(std::complex<float> arr[], std::complex<float> out[], int numqbits, std::complex<float> a,
std::complex<float> b, std::complex<float> c, std::complex<float> d, int target)
{
long length = 1 << (numqbits);
long offset = 1 << (target - 1);
long state = 0;
while (state < length)
{
out[state] = arr[state] * a + arr[state + offset] * b;
out[state + offset] = arr[state] * c + arr[state + offset] * d;
state += 1 + offset * (((state%offset) + 1) / offset);
}
}
void matmulpsingle(std::complex<float> arr[], std::complex<float> out[], int numqbits, std::complex<float> a,
std::complex<float> b, std::complex<float> c, std::complex<float> d, int target)
{
long length = 1 << (numqbits);
int shift = target - 1;
long offset = 1 << shift;
for (long state = 0; state < length; ++state)
{
if ((state >> shift) & 1)
{
out[state] = arr[state - offset] * c + arr[state] * d;
}
else
{
out[state] = arr[state] * a + arr[state + offset] * b;
}
}
}
int main()
{
using namespace std;
int numqbits = 25;
long arraylength = 1 << numqbits;
complex<float>* amplitudes = new complex<float>[arraylength];
for (long i = 0; i < arraylength; ++i)
{
amplitudes[i] = complex<float>(0., 0.);
}
amplitudes[0] = complex<float>(1., 0.);
complex<float> a(0., 0.);
complex<float> b(1., 0.);
complex<float> c(0., 0.);
complex<float> d(1., 0.);
int target = 1;
int repititions = 10;
clock_t startTime;
//while (target <= numqbits) {
startTime = clock();
for (int j = 0; j < repititions; ++j) {
complex<float>* outputs = new complex<float>[arraylength];
matmulpsingle(amplitudes, outputs, numqbits, a, b, c, d, target);
delete[] outputs;
}
cout << float(clock() - startTime) / (float)(CLOCKS_PER_SEC*repititions) << " seconds." << endl;
startTime = clock();
for (int k = 0; k < repititions; ++k) {
complex<float>* outputs = new complex<float>[arraylength];
matmulpnoif(amplitudes, outputs, numqbits, a, b, c, d, target);
delete[] outputs;
}
cout << float(clock() - startTime) / (float)(CLOCKS_PER_SEC*repititions) << " seconds." << endl;
target+=1;
//}
delete[] amplitudes;
return 0;
}
Unfortunately, I can not yet post comments, so I'll post this here even though it may not be a complete answer.
In general, the question you pose is difficult. The compiler performs optimisations, and the two cases are different code so they get optimised differently.
On my machine, for instance (Linux, GCC 7.3.1), with only -O3 enabled, the matmulpnoif is always faster (4.8s vs 2.4s or 4.8s vs 4.2s - these times are not measured with clock(), depending on whether the loop is there or not). If I had to guess what happens in this case, the compiler might realise that offset is always one, and optimise the remainder operation away (division is by far the most expensive operation you have in there). However, it could be a combination of other things as well.
Another thing to note, clock() should NOT be used to measure time. It counts the number of clock ticks, for instance, if you parallelise the code across 2 threads the number will be twice the time (assuming your code doesn't wait anywhere - which does not appear to be the case on my machine). If you wish to measure time, I suggest you look at <chrono>, the high_resolution_clock should do the trick.
Another side note, there is no need to keep allocating and deallocating the output array, you can simply use the one, that way you will waste less time. But above all, if you're using C++ I suggest you put all of this in a class, as it is you are passing many parameters to each function, it can make things both difficult to read and slower, if you pass a lot of data (as it gets copied).
And a second note, since you are using bit shifts, it might be safer to use unsigned variables as the right shift >> does not have a strict definition of what it pads with with signed variables. At the very least it's something to keep in mind, it might be padding 1s on that side.

Would a pre-calculated variable faster than calculating it every time in a loop?

In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.

large loop for timing tests gets somehow optimized to nothing?

I am trying to test a series of libraries for matrix-vector computations. For that I just make a large loop and inside I call the routine I want to time. Very simple. However I sometimes see that when I increase the level of optimization for the compiler the time drops to zero no matter how large the loop is. See the example below where I try to time a C macro to compute cross products. What is the compiler doing? how can I avoid it but to allow maximum optimization for floating point arithmetics? Thank you in advance
The example below was compiled using g++ 4.7.2 on a computer with an i5 intel processor.
Using optimization level 1 (-O1) it takes 0.35 seconds. For level two or higher it drops down to zero. Remember, I want to time this so I want the computations to actually happen even if, for this simple test, unnecessary.
#include<iostream>
using namespace std;
typedef double Vector[3];
#define VecCross(A,assign_op,B,dummy_op,C) \
( A[0] assign_op (B[1] * C[2]) - (B[2] * C[1]), \
A[1] assign_op (B[2] * C[0]) - (B[0] * C[2]), \
A[2] assign_op (B[0] * C[1]) - (B[1] * C[0]) \
)
double get_time(){
return clock()/(double)CLOCKS_PER_SEC;
}
int main()
{
unsigned long n = 1000000000u;
double start;
{//C macro cross product
Vector u = {1,0,0};
Vector v = {1,1,0};
Vector w = {1.2,1.2,1.2};
start = get_time();
for(unsigned long i=0;i<n;i++){
VecCross (w, =, u, X, v);
}
cout << "C macro cross product: " << get_time()-start << endl;
}
return 0;
}
Ask yourself, what does your program actually do, in terms of what is visible to the end-user?
It displays the result of a calculation: get_time()-start. The contents of your loop have no bearing on the outcome of that calculation, because you never actually use the variables being modified inside the loop.
Therefore, the compiler optimises out the entire loop since it is irrelevant.
One solution is to output the final state of the variables being modified in the loop, as part of your cout statement, thus forcing the compiler to compute the loop. However, a smart compiler could also figure out that the loop always calculates the same thing, and it can simply insert the result directly into your cout statement, because there's no need to actually calculate it at run-time. As a workaround to this, you could for example require that one of the inputs to the loop be provided at run-time (e.g. read it in from a file, command line argument, cin, etc.).
For more (and possibly better) solutions, check out this duplicate thread: Force compiler to not optimize side-effect-less statements

Bad optimization of std::fabs()?

Recently i was working with an application that had code similar to:
for (auto x = 0; x < width - 1 - left; ++x)
{
// store / reset points
temp = hPoint = 0;
for(int channel = 0; channel < audioData.size(); channel++)
{
if (peakmode) /* fir rms of window size */
{
for (int z = 0; z < sizeFactor; z++)
{
temp += audioData[channel][x * sizeFactor + z + offset];
}
hPoint += temp / sizeFactor;
}
else /* highest sample in window */
{
for (int z = 0; z < sizeFactor; z++)
{
temp = audioData[channel][x * sizeFactor + z + offset];
if (std::fabs(temp) > std::fabs(hPoint))
hPoint = temp;
}
}
.. some other code
}
... some more code
}
This is inside a graphical render loop, called some 50-100 times / sec with buffers up to 192kHz in multiple channels. So it's a lot of data running through the innermost loops, and profiling showed this was a hotspot.
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries. It looked something like this:
if ((const float &&)(*((int *)&temp) & ~0x80000000) > (const float &&)(*((int *)&hPoint) & ~0x80000000))
hPoint = temp;
This gave a 12x reduction in render time, while still producing the same, valid output. Note that everything in the audiodata is sanitized beforehand to not include nans/infs/denormals, and only have a range of [-1, 1].
Are there any corner cases where this optimization will give wrong results - or, why is the standard library function not implemented like this? I presume it has to do with handling of non-normal values?
e: the layout of the floating point model is conforming to ieee, and sizeof(float) == sizeof(int) == 4
Well, you set the floating-point mode to IEEE conforming. Typically, with switches like --fast-math the compiler can ignore IEEE corner cases like NaN, INF and denormals. If the compiler also uses intrinsics, it can probably emit the same code.
BTW, if you're going to assume IEEE format, there's no need for the cast back to float prior to the comparison. The IEEE format is nifty: for all positive finite values, a<b if and only if reinterpret_cast<int_type>(a) < reinterpret_cast<int_type>(b)
It occurred to me that one could cast the float to an integer and erase the sign bit, and cast it back using only temporaries.
No, you can't, because this violates the strict aliasing rule.
Are there any corner cases where this optimization will give wrong results
Technically, this code results in undefined behavior, so it always gives wrong "results". Not in the sense that the result of the absolute value will always be unexpected or incorrect, but in the sense that you can't possibly reason about what a program does if it has undefined behavior.
or, why is the standard library function not implemented like this?
Your suspicion is justified, handling denormals and other exceptional values is tricky, the stdlib function also needs to take those into account, and the other reason is still the undefined behavior.
One (non-)solution if you care about performance:
Instead of casting and pointers, you can use a union. Unfortunately, that only works in C, not in C++, though. That won't result in UB, but it's still not portable (although it will likely work with most, if not all, platforms with IEEE-754).
union {
float f;
unsigned u;
} pun = { .f = -3.14 };
pun.u &= ~0x80000000;
printf("abs(-pi) = %f\n", pun.f);
But, granted, this may or may not be faster than calling fabs(). Only one thing is sure: it won't be always correct.
You would expect fabs() to be implemented in hardware. There was an 8087 instruction for it in 1980 after all. You're not going to beat the hardware.
How the standard library function implements it is .... implementation dependent. So you may find different implementation of the standard library with different performance.
I imagine that you could have problems in platforms where int is not 32 bits. You 'd better use int32_t (cstdint>)
For my knowledge, was std::abs previously inlined ? Or the optimisation you observed is mainly due to suppression of the function call ?
Some observations on how refactoring may improve performance:
as mentioned, x * sizeFactor + offset can be factored out of the inner loops
peakmode is actually a switch changing the function's behaviour - make two functions rather than test the switch mid-loop. This has 2 benefits:
easier to maintain
fewer local variables and code paths to get in the way of optimisation.
The division of temp by sizeFactor can be deferred until outside the channel loop in the peakmode version.
abs(hPoint) can be pre-computed whenever hPoint is updated
if audioData is a vector of vectors you may get some performance benefit by taking a reference to audioData[channel] at the start of the body of the channel loop, reducing the array indexing within the z loop down to one dimension.
finally, apply whatever specific optimisations for the calculation of fabs you deem fit. Anything you do here will hurt portability so it's a last resort.
In VS2008, using the following to track the absolute value of hpoint and hIsNeg to remember whether it is positive or negative is about twice as fast as using fabs():
int hIsNeg=0 ;
...
//Inside loop, replacing
// if (std::fabs(temp) > std::fabs(hPoint))
// hPoint = temp;
if( temp < 0 )
{
if( -temp > hpoint )
{
hpoint = -temp ;
hIsNeg = 1 ;
}
}
else
{
if( temp > hpoint )
{
hpoint = temp ;
hIsNeg = 0 ;
}
}
...
//After loop
if( hIsNeg )
hpoint = -hpoint ;

Benchmarking math.h square root and Quake square root

Okay so I was board and wondered how fast math.h square root was in comparison to the one with the magic number in it (made famous by Quake but made by SGI).
But this has ended up in a world of hurt for me.
I first tried this on the Mac where the math.h would win hands down every time then on Windows where the magic number always won, but I think this is all down to my own noobness.
Compiling on the Mac with "g++ -o sq_root sq_root_test.cpp" when the program ran it takes about 15 seconds to complete. But compiling in VS2005 on release takes a split second. (in fact I had to compile in debug just to get it to show some numbers)
My poor man's benchmarking? is this really stupid? cos I get 0.01 for math.h and 0 for the Magic number. (it cant be that fast can it?)
I don't know if this matters but the Mac is Intel and the PC is AMD. Is the Mac using hardware for math.h sqroot?
I got the fast square root algorithm from http://en.wikipedia.org/wiki/Fast_inverse_square_root
//sq_root_test.cpp
#include <iostream>
#include <math.h>
#include <ctime>
float invSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
int main() {
std::clock_t start;// = std::clock();
std::clock_t end;
float rootMe;
int iterations = 999999999;
// ---
rootMe = 2.0f;
start = std::clock();
std::cout << "Math.h SqRoot: ";
for (int m = 0; m < iterations; m++) {
(float)(1.0/sqrt(rootMe));
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
// ---
std::cout << "Quake SqRoot: ";
rootMe = 2.0f;
start = std::clock();
for (int q = 0; q < iterations; q++) {
invSqrt(rootMe);
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
}
There are several problems with your benchmarks. First, your benchmark includes a potentially expensive cast from int to float. If you want to know what a square root costs, you should benchmark square roots, not datatype conversions.
Second, your entire benchmark can be (and is) optimized out by the compiler because it has no observable side effects. You don't use the returned value (or store it in a volatile memory location), so the compiler sees that it can skip the whole thing.
A clue here is that you had to disable optimizations. That means your benchmarking code is broken. Never ever disable optimizations when benchmarking. You want to know which version runs fastest, so you should test it under the conditions it'd actually be used under. If you were to use square roots in performance-sensitive code, you'd enable optimizations, so how it behaves without optimizations is completely irrelevant.
Also, you're not benchmarking the cost of computing a square root, but of the inverse square root.
If you want to know which way of computing the square root is fastest, you have to move the 1.0/... division down to the Quake version. (And since division is a pretty expensive operation, this might make a big difference in your results)
Finally, it might be worth pointing out that Carmacks little trick was designed to be fast on 12 year old computers. Once you fix your benchmark, you'll probably find that it's no longer an optimization, because today's CPU's are much faster at computing "real" square roots.