Calculating GFlops - c++

I'm wondering how would I calculate GFlops for a program of mine like, let's say, a CUDA application.
Do I need to measure the execution time and the number of floating point operations in my code? If I had an operation like "logf", would it count for just one flop?

The number of ACTUAL floating point operations would depend on exactly how the code is written (compilers can optimize in both directions - that is, merging common operatoions c = (a * 4.0 + b * 4.0); can becomes c = (a + b) * 4.0, which is one less than what you wrote. But the compiler can also convert something to MORE operations:
c = a / b;
may turn into:
temp = 1 / b;
c = temp * a;
(This because 1/x is "simpler" than y/x, and multiplication is faster than division).
As mentioned in the comments, some floating point operations (log, sin, cos, etc) will take more than one, often more than ten, operations to get the result.
Another factor to take into account is "loads" and "stores". These can be quite hard to predict, as it is highly dependant on the compilers code generation, number of registers available to the compiler at a given point, etc, etc. Whether loads and stores actually count or not depends on how you look at things, but they certainly count towards the total execution time. If there is a lot of data to work through, but each step is really simple (e.g. c = a + b where a, b and c are vectors), the time to fetch data from memory is significantly longer than the execution time of add. On the other hand, c = log(a) + log(b); would almost certainly "hide" the time to load and store the results, because log itself takes a lot longer than the load or store operations.

Related

Why is vectorization not beneficial in this for loop?

I am trying to vectorize this for loop. After using the Rpass flag, I am getting the following remark for it:
int someOuterVariable = 0;
for (unsigned int i = 7; i != -1; i--)
{
array[someOuterVariable + i] -= 0.3 * anotherArray[i];
}
Remark:
The cost-model indicates that vectorization is not beneficial
the cost-model indicates that interleaving is not beneficial
I want to understand what this means. Does "interleaving is not benificial" mean the array indexing is not proper?
It's hard to answer without more details about your types. But in general, starting a loop incurs some costs and vectorising also implies some costs (such as moving data to/from SIMD registers, ensuring proper alignment of data)
I'm guessing here that the compiler tells you that the vectorisation cost here is bigger than simply running the 8 iterations without it, so it's not doing it.
Try to increase the number of iterations, or help the compiler for computing alignement for example.
Typically, unless the type of array's item are exactly of the proper alignment for SIMD vector, accessing an array from a "unknown" offset (what you've called someOuterVariable) prevents the compiler to write an efficient vectorisation code.
EDIT: About the "interleaving" question, it's hard to guess without knowning your tool. But in general, interleaving usually means mixing 2 streams of computations so that the compute units of the CPU are all busy. For example, if you have 2 ALU in your CPU, and the program is doing:
c = a + b;
d = e * f;
The compiler can interleave the computation so that both the addition and multiplication happens at the same time (provided you have 2 ALU available). Typically, this means that the multiplication which is a bit longer to compute (for example 6 cycles) will be started before the addition (for example 3 cycles). You'll then get the result of both operation after only 6 cycles instead of 9 if the compiler serialized the computations. This is only possible if there is no dependencies between the computation (if d required c, it can not work). A compiler is very cautious about this, and, in your example, will not apply this optimization if it can't prove that array and anotherArray don't alias.

Performance of duplicate computations

I wonder if it's worth to make computation one time and store the result or it's faster to do twice the computation?
For example in this case:
float n1 = a - b;
float n2 = a + b;
float result = n1 * n2 / (n1 * n2);
Is it better to do:
float result = (a - b) * (a + b) / ((a - b) * (a + b));
?
I know that normally we store the result but I wonder if it's not faster to do the addition instead of calling the memory to store/retrieve the value.
It really depends: For trivial examples like yours, it does not matter. The compiler will generate the same code, since it finds the common sub-expressions and eliminates the duplicated calculations.
For more complicated examples, for example involving function calls, you are better off to use the first variant, to "store" intermediate results. Do not worry about using simple variables for intermediate storage. These are usually all kept in CPU registers, and the compiler is quite good in keeping values in registers.
The danger is that with more complex calculations the compiler may fail to do the common sub-expression elimination. This is for example the case when your code contains function calls which act like a compiler boundary.
Another topic is that with floating point, even simple operations like addition are not associative, i.e. (a+b)+c is different from a+(b+c), due to artifacts in the lowest bits. This often also prevents common subexpression elimination, since the compiler is not allowed to change the semantics of your code.
Dividing the expression into smaller expressions and giving them sensible names gives You several benefits:
It decreases cognitive load.
The longer expression could be now easier to understand and verified correct.
The line of code could be shorter which makes it easier to read and adhere to coding standards.
In C++ a temporary variable could also be marked const, then this also allows the compiler to better optimize the expressions.
But optimizations should be measured before they are discussed and used as arguments. Fast usually comes from the choice of data structures and used algorithms.
In general code should be written to be understood and be correct, and only then should it be optimized.
const float difference = a - b;
const float sum = a + b;
const float result = difference * sum / (difference * sum);

Which approximation algorithm is used for sin() by compilers? [duplicate]

I've been poring through .NET disassemblies and the GCC source code, but can't seem to find anywhere the actual implementation of sin() and other math functions... they always seem to be referencing something else.
Can anyone help me find them? I feel like it's unlikely that ALL hardware that C will run on supports trig functions in hardware, so there must be a software algorithm somewhere, right?
I'm aware of several ways that functions can be calculated, and have written my own routines to compute functions using taylor series for fun. I'm curious about how real, production languages do it, since all of my implementations are always several orders of magnitude slower, even though I think my algorithms are pretty clever (obviously they're not).
In GNU libm, the implementation of sin is system-dependent. Therefore you can find the implementation, for each platform, somewhere in the appropriate subdirectory of sysdeps.
One directory includes an implementation in C, contributed by IBM. Since October 2011, this is the code that actually runs when you call sin() on a typical x86-64 Linux system. It is apparently faster than the fsin assembly instruction. Source code: sysdeps/ieee754/dbl-64/s_sin.c, look for __sin (double x).
This code is very complex. No one software algorithm is as fast as possible and also accurate over the whole range of x values, so the library implements several different algorithms, and its first job is to look at x and decide which algorithm to use.
When x is very very close to 0, sin(x) == x is the right answer.
A bit further out, sin(x) uses the familiar Taylor series. However, this is only accurate near 0, so...
When the angle is more than about 7°, a different algorithm is used, computing Taylor-series approximations for both sin(x) and cos(x), then using values from a precomputed table to refine the approximation.
When |x| > 2, none of the above algorithms would work, so the code starts by computing some value closer to 0 that can be fed to sin or cos instead.
There's yet another branch to deal with x being a NaN or infinity.
This code uses some numerical hacks I've never seen before, though for all I know they might be well-known among floating-point experts. Sometimes a few lines of code would take several paragraphs to explain. For example, these two lines
double t = (x * hpinv + toint);
double xn = t - toint;
are used (sometimes) in reducing x to a value close to 0 that differs from x by a multiple of π/2, specifically xn × π/2. The way this is done without division or branching is rather clever. But there's no comment at all!
Older 32-bit versions of GCC/glibc used the fsin instruction, which is surprisingly inaccurate for some inputs. There's a fascinating blog post illustrating this with just 2 lines of code.
fdlibm's implementation of sin in pure C is much simpler than glibc's and is nicely commented. Source code: fdlibm/s_sin.c and fdlibm/k_sin.c
Functions like sine and cosine are implemented in microcode inside microprocessors. Intel chips, for example, have assembly instructions for these. A C compiler will generate code that calls these assembly instructions. (By contrast, a Java compiler will not. Java evaluates trig functions in software rather than hardware, and so it runs much slower.)
Chips do not use Taylor series to compute trig functions, at least not entirely. First of all they use CORDIC, but they may also use a short Taylor series to polish up the result of CORDIC or for special cases such as computing sine with high relative accuracy for very small angles. For more explanation, see this StackOverflow answer.
OK kiddies, time for the pros....
This is one of my biggest complaints with inexperienced software engineers. They come in calculating transcendental functions from scratch (using Taylor's series) as if nobody had ever done these calculations before in their lives. Not true. This is a well defined problem and has been approached thousands of times by very clever software and hardware engineers and has a well defined solution.
Basically, most of the transcendental functions use Chebyshev Polynomials to calculate them. As to which polynomials are used depends on the circumstances. First, the bible on this matter is a book called "Computer Approximations" by Hart and Cheney. In that book, you can decide if you have a hardware adder, multiplier, divider, etc, and decide which operations are fastest. e.g. If you had a really fast divider, the fastest way to calculate sine might be P1(x)/P2(x) where P1, P2 are Chebyshev polynomials. Without the fast divider, it might be just P(x), where P has much more terms than P1 or P2....so it'd be slower. So, first step is to determine your hardware and what it can do. Then you choose the appropriate combination of Chebyshev polynomials (is usually of the form cos(ax) = aP(x) for cosine for example, again where P is a Chebyshev polynomial). Then you decide what decimal precision you want. e.g. if you want 7 digits precision, you look that up in the appropriate table in the book I mentioned, and it will give you (for precision = 7.33) a number N = 4 and a polynomial number 3502. N is the order of the polynomial (so it's p4.x^4 + p3.x^3 + p2.x^2 + p1.x + p0), because N=4. Then you look up the actual value of the p4,p3,p2,p1,p0 values in the back of the book under 3502 (they'll be in floating point). Then you implement your algorithm in software in the form:
(((p4.x + p3).x + p2).x + p1).x + p0
....and this is how you'd calculate cosine to 7 decimal places on that hardware.
Note that most hardware implementations of transcendental operations in an FPU usually involve some microcode and operations like this (depends on the hardware).
Chebyshev polynomials are used for most transcendentals but not all. e.g. Square root is faster to use a double iteration of Newton raphson method using a lookup table first.
Again, that book "Computer Approximations" will tell you that.
If you plan on implmementing these functions, I'd recommend to anyone that they get a copy of that book. It really is the bible for these kinds of algorithms.
Note that there are bunches of alternative means for calculating these values like cordics, etc, but these tend to be best for specific algorithms where you only need low precision. To guarantee the precision every time, the chebyshev polynomials are the way to go. Like I said, well defined problem. Has been solved for 50 years now.....and thats how it's done.
Now, that being said, there are techniques whereby the Chebyshev polynomials can be used to get a single precision result with a low degree polynomial (like the example for cosine above). Then, there are other techniques to interpolate between values to increase the accuracy without having to go to a much larger polynomial, such as "Gal's Accurate Tables Method". This latter technique is what the post referring to the ACM literature is referring to. But ultimately, the Chebyshev Polynomials are what are used to get 90% of the way there.
Enjoy.
For sin specifically, using Taylor expansion would give you:
sin(x) := x - x^3/3! + x^5/5! - x^7/7! + ... (1)
you would keep adding terms until either the difference between them is lower than an accepted tolerance level or just for a finite amount of steps (faster, but less precise). An example would be something like:
float sin(float x)
{
float res=0, pow=x, fact=1;
for(int i=0; i<5; ++i)
{
res+=pow/fact;
pow*=-1*x*x;
fact*=(2*(i+1))*(2*(i+1)+1);
}
return res;
}
Note: (1) works because of the aproximation sin(x)=x for small angles. For bigger angles you need to calculate more and more terms to get acceptable results.
You can use a while argument and continue for a certain accuracy:
double sin (double x){
int i = 1;
double cur = x;
double acc = 1;
double fact= 1;
double pow = x;
while (fabs(acc) > .00000001 && i < 100){
fact *= ((2*i)*(2*i+1));
pow *= -1 * x*x;
acc = pow / fact;
cur += acc;
i++;
}
return cur;
}
Concerning trigonometric function like sin(), cos(),tan() there has been no mention, after 5 years, of an important aspect of high quality trig functions: Range reduction.
An early step in any of these functions is to reduce the angle, in radians, to a range of a 2*π interval. But π is irrational so simple reductions like x = remainder(x, 2*M_PI) introduce error as M_PI, or machine pi, is an approximation of π. So, how to do x = remainder(x, 2*π)?
Early libraries used extended precision or crafted programming to give quality results but still over a limited range of double. When a large value was requested like sin(pow(2,30)), the results were meaningless or 0.0 and maybe with an error flag set to something like TLOSS total loss of precision or PLOSS partial loss of precision.
Good range reduction of large values to an interval like -π to π is a challenging problem that rivals the challenges of the basic trig function, like sin(), itself.
A good report is Argument reduction for huge arguments: Good to the last bit (1992). It covers the issue well: discusses the need and how things were on various platforms (SPARC, PC, HP, 30+ other) and provides a solution algorithm the gives quality results for all double from -DBL_MAX to DBL_MAX.
If the original arguments are in degrees, yet may be of a large value, use fmod() first for improved precision. A good fmod() will introduce no error and so provide excellent range reduction.
// sin(degrees2radians(x))
sin(degrees2radians(fmod(x, 360.0))); // -360.0 < fmod(x,360) < +360.0
Various trig identities and remquo() offer even more improvement. Sample: sind()
Yes, there are software algorithms for calculating sin too. Basically, calculating these kind of stuff with a digital computer is usually done using numerical methods like approximating the Taylor series representing the function.
Numerical methods can approximate functions to an arbitrary amount of accuracy and since the amount of accuracy you have in a floating number is finite, they suit these tasks pretty well.
Use Taylor series and try to find relation between terms of the series so you don't calculate things again and again
Here is an example for cosinus:
double cosinus(double x, double prec)
{
double t, s ;
int p;
p = 0;
s = 1.0;
t = 1.0;
while(fabs(t/s) > prec)
{
p++;
t = (-t * x * x) / ((2 * p - 1) * (2 * p));
s += t;
}
return s;
}
using this we can get the new term of the sum using the already used one (we avoid the factorial and x2p)
It is a complex question. Intel-like CPU of the x86 family have a hardware implementation of the sin() function, but it is part of the x87 FPU and not used anymore in 64-bit mode (where SSE2 registers are used instead). In that mode, a software implementation is used.
There are several such implementations out there. One is in fdlibm and is used in Java. As far as I know, the glibc implementation contains parts of fdlibm, and other parts contributed by IBM.
Software implementations of transcendental functions such as sin() typically use approximations by polynomials, often obtained from Taylor series.
Chebyshev polynomials, as mentioned in another answer, are the polynomials where the largest difference between the function and the polynomial is as small as possible. That is an excellent start.
In some cases, the maximum error is not what you are interested in, but the maximum relative error. For example for the sine function, the error near x = 0 should be much smaller than for larger values; you want a small relative error. So you would calculate the Chebyshev polynomial for sin x / x, and multiply that polynomial by x.
Next you have to figure out how to evaluate the polynomial. You want to evaluate it in such a way that the intermediate values are small and therefore rounding errors are small. Otherwise the rounding errors might become a lot larger than errors in the polynomial. And with functions like the sine function, if you are careless then it may be possible that the result that you calculate for sin x is greater than the result for sin y even when x < y. So careful choice of the calculation order and calculation of upper bounds for the rounding error are needed.
For example, sin x = x - x^3/6 + x^5 / 120 - x^7 / 5040... If you calculate naively sin x = x * (1 - x^2/6 + x^4/120 - x^6/5040...), then that function in parentheses is decreasing, and it will happen that if y is the next larger number to x, then sometimes sin y will be smaller than sin x. Instead, calculate sin x = x - x^3 * (1/6 - x^2 / 120 + x^4/5040...) where this cannot happen.
When calculating Chebyshev polynomials, you usually need to round the coefficients to double precision, for example. But while a Chebyshev polynomial is optimal, the Chebyshev polynomial with coefficients rounded to double precision is not the optimal polynomial with double precision coefficients!
For example for sin (x), where you need coefficients for x, x^3, x^5, x^7 etc. you do the following: Calculate the best approximation of sin x with a polynomial (ax + bx^3 + cx^5 + dx^7) with higher than double precision, then round a to double precision, giving A. The difference between a and A would be quite large. Now calculate the best approximation of (sin x - Ax) with a polynomial (b x^3 + cx^5 + dx^7). You get different coefficients, because they adapt to the difference between a and A. Round b to double precision B. Then approximate (sin x - Ax - Bx^3) with a polynomial cx^5 + dx^7 and so on. You will get a polynomial that is almost as good as the original Chebyshev polynomial, but much better than Chebyshev rounded to double precision.
Next you should take into account the rounding errors in the choice of polynomial. You found a polynomial with minimum error in the polynomial ignoring rounding error, but you want to optimise polynomial plus rounding error. Once you have the Chebyshev polynomial, you can calculate bounds for the rounding error. Say f (x) is your function, P (x) is the polynomial, and E (x) is the rounding error. You don't want to optimise | f (x) - P (x) |, you want to optimise | f (x) - P (x) +/- E (x) |. You will get a slightly different polynomial that tries to keep the polynomial errors down where the rounding error is large, and relaxes the polynomial errors a bit where the rounding error is small.
All this will get you easily rounding errors of at most 0.55 times the last bit, where +,-,*,/ have rounding errors of at most 0.50 times the last bit.
The actual implementation of library functions is up to the specific compiler and/or library provider. Whether it's done in hardware or software, whether it's a Taylor expansion or not, etc., will vary.
I realize that's absolutely no help.
There's nothing like hitting the source and seeing how someone has actually done it in a library in common use; let's look at one C library implementation in particular. I chose uLibC.
Here's the sin function:
http://git.uclibc.org/uClibc/tree/libm/s_sin.c
which looks like it handles a few special cases, and then carries out some argument reduction to map the input to the range [-pi/4,pi/4], (splitting the argument into two parts, a big part and a tail) before calling
http://git.uclibc.org/uClibc/tree/libm/k_sin.c
which then operates on those two parts.
If there is no tail, an approximate answer is generated using a polynomial of degree 13.
If there is a tail, you get a small corrective addition based on the principle that sin(x+y) = sin(x) + sin'(x')y
They are typically implemented in software and will not use the corresponding hardware (that is, aseembly) calls in most cases. However, as Jason pointed out, these are implementation specific.
Note that these software routines are not part of the compiler sources, but will rather be found in the correspoding library such as the clib, or glibc for the GNU compiler. See http://www.gnu.org/software/libc/manual/html_mono/libc.html#Trig-Functions
If you want greater control, you should carefully evaluate what you need exactly. Some of the typical methods are interpolation of look-up tables, the assembly call (which is often slow), or other approximation schemes such as Newton-Raphson for square roots.
If you want an implementation in software, not hardware, the place to look for a definitive answer to this question is Chapter 5 of Numerical Recipes. My copy is in a box, so I can't give details, but the short version (if I remember this right) is that you take tan(theta/2) as your primitive operation and compute the others from there. The computation is done with a series approximation, but it's something that converges much more quickly than a Taylor series.
Sorry I can't rembember more without getting my hand on the book.
Whenever such a function is evaluated, then at some level there is most likely either:
A table of values which is interpolated (for fast, inaccurate applications - e.g. computer graphics)
The evaluation of a series that converges to the desired value --- probably not a taylor series, more likely something based on a fancy quadrature like Clenshaw-Curtis.
If there is no hardware support then the compiler probably uses the latter method, emitting only assembler code (with no debug symbols), rather than using a c library --- making it tricky for you to track the actual code down in your debugger.
If you want to look at the actual GNU implementation of those functions in C, check out the latest trunk of glibc. See the GNU C Library.
As many people pointed out, it is implementation dependent. But as far as I understand your question, you were interested in a real software implemetnation of math functions, but just didn't manage to find one. If this is the case then here you are:
Download glibc source code from http://ftp.gnu.org/gnu/glibc/
Look at file dosincos.c located in unpacked glibc root\sysdeps\ieee754\dbl-64 folder
Similarly you can find implementations of the rest of the math library, just look for the file with appropriate name
You may also have a look at the files with the .tbl extension, their contents is nothing more than huge tables of precomputed values of different functions in a binary form. That is why the implementation is so fast: instead of computing all the coefficients of whatever series they use they just do a quick lookup, which is much faster. BTW, they do use Tailor series to calculate sine and cosine.
I hope this helps.
I'll try to answer for the case of sin() in a C program, compiled with GCC's C compiler on a current x86 processor (let's say a Intel Core 2 Duo).
In the C language the Standard C Library includes common math functions, not included in the language itself (e.g. pow, sin and cos for power, sine, and cosine respectively). The headers of which are included in math.h.
Now on a GNU/Linux system, these libraries functions are provided by glibc (GNU libc or GNU C Library). But the GCC compiler wants you to link to the math library (libm.so) using the -lm compiler flag to enable usage of these math functions. I'm not sure why it isn't part of the standard C library. These would be a software version of the floating point functions, or "soft-float".
Aside: The reason for having the math functions separate is historic, and was merely intended to reduce the size of executable programs in very old Unix systems, possibly before shared libraries were available, as far as I know.
Now the compiler may optimize the standard C library function sin() (provided by libm.so) to be replaced with an call to a native instruction to your CPU/FPU's built-in sin() function, which exists as an FPU instruction (FSIN for x86/x87) on newer processors like the Core 2 series (this is correct pretty much as far back as the i486DX). This would depend on optimization flags passed to the gcc compiler. If the compiler was told to write code that would execute on any i386 or newer processor, it would not make such an optimization. The -mcpu=486 flag would inform the compiler that it was safe to make such an optimization.
Now if the program executed the software version of the sin() function, it would do so based on a CORDIC (COordinate Rotation DIgital Computer) or BKM algorithm, or more likely a table or power-series calculation which is commonly used now to calculate such transcendental functions. [Src: http://en.wikipedia.org/wiki/Cordic#Application]
Any recent (since 2.9x approx.) version of gcc also offers a built-in version of sin, __builtin_sin() that it will used to replace the standard call to the C library version, as an optimization.
I'm sure that is as clear as mud, but hopefully gives you more information than you were expecting, and lots of jumping off points to learn more yourself.
Don't use Taylor series. Chebyshev polynomials are both faster and more accurate, as pointed out by a couple of people above. Here is an implementation (originally from the ZX Spectrum ROM): https://albertveli.wordpress.com/2015/01/10/zx-sine/
Computing sine/cosine/tangent is actually very easy to do through code using the Taylor series. Writing one yourself takes like 5 seconds.
The whole process can be summed up with this equation here:
Here are some routines I wrote for C:
double _pow(double a, double b) {
double c = 1;
for (int i=0; i<b; i++)
c *= a;
return c;
}
double _fact(double x) {
double ret = 1;
for (int i=1; i<=x; i++)
ret *= i;
return ret;
}
double _sin(double x) {
double y = x;
double s = -1;
for (int i=3; i<=100; i+=2) {
y+=s*(_pow(x,i)/_fact(i));
s *= -1;
}
return y;
}
double _cos(double x) {
double y = 1;
double s = -1;
for (int i=2; i<=100; i+=2) {
y+=s*(_pow(x,i)/_fact(i));
s *= -1;
}
return y;
}
double _tan(double x) {
return (_sin(x)/_cos(x));
}
Improved version of code from Blindy's answer
#define EPSILON .0000000000001
// this is smallest effective threshold, at least on my OS (WSL ubuntu 18)
// possibly because factorial part turns 0 at some point
// and it happens faster then series element turns 0;
// validation was made against sin() from <math.h>
double ft_sin(double x)
{
int k = 2;
double r = x;
double acc = 1;
double den = 1;
double num = x;
// precision drops rapidly when x is not close to 0
// so move x to 0 as close as possible
while (x > PI)
x -= PI;
while (x < -PI)
x += PI;
if (x > PI / 2)
return (ft_sin(PI - x));
if (x < -PI / 2)
return (ft_sin(-PI - x));
// not using fabs for performance reasons
while (acc > EPSILON || acc < -EPSILON)
{
num *= -x * x;
den *= k * (k + 1);
acc = num / den;
r += acc;
k += 2;
}
return (r);
}
The essence of how it does this lies in this excerpt from Applied Numerical Analysis by Gerald Wheatley:
When your software program asks the computer to get a value of
or , have you wondered how it can get the
values if the most powerful functions it can compute are polynomials?
It doesnt look these up in tables and interpolate! Rather, the
computer approximates every function other than polynomials from some
polynomial that is tailored to give the values very accurately.
A few points to mention on the above is that some algorithms do infact interpolate from a table, albeit only for the first few iterations. Also note how it mentions that computers utilise approximating polynomials without specifying which type of approximating polynomial. As others in the thread have pointed out, Chebyshev polynomials are more efficient than Taylor polynomials in this case.
if you want sin then
__asm__ __volatile__("fsin" : "=t"(vsin) : "0"(xrads));
if you want cos then
__asm__ __volatile__("fcos" : "=t"(vcos) : "0"(xrads));
if you want sqrt then
__asm__ __volatile__("fsqrt" : "=t"(vsqrt) : "0"(value));
so why use inaccurate code when the machine instructions will do?

Predefinition of often used values in computations - does it change anything?

I'm auto generating C code to compute large expressions and try to figure out with simple examples whether it makes sense to predefine certain subparts in separate variables.
As a simple example, say we compute something of the form:
#include <cmath>
double test(double x, double y) {
const double c[9][9] = { ... }; // constants properly initialized, irrelevant
double expr = c[0][0]*x*y
+ c[1][0]*pow(x,2)*y + ... + c[8][0]*pow(x,9)*y
+ c[1][1]*pow(x,2)*pow(y,2) + ... + c[8][1]*pow(x,9)*pow(y,2)
+ ...
with all c[i][j] properly initialized. In reality those expressions contain tens of millions of multiplications and additions.
A colleague now proposed -- to reduce the number of calls to pow() and to cache often needed values in the expressions -- to define every power of x and y in a separate variable, which is no big deal as the code is auto generated anyway, like this:
double xp2 = pow(x,2);
double xp3 = pow(x,3);
double xp4 = pow(x,4);
// ...
// same for pow(y,n)
I think, however, that this is unnecessary, as the compiler should take care of these optimizations.
Unfortunately, I have no experience with reading and interpreting assembly but I think I see that all the calls to pow() are optimized out, is this right? Also, does the compiler cache the values for pow(x,2), pow(x,3), etc?
Thanks in advance for your input!
Using pow with integer arguments... ouch ! Typical implementations of pow are tuned for the general case of floating point arguments, which is why it is usually way slower to write
pow(x, 2) ( = exp(2 * log(x)) )
than
x * x
What I state here is very compiler dependant though. On one hand, some compilers may not even know that pow(x, 2) will yield the same value for a given x (after all, the extern function pow could have side effects), so you don't have any guarantee that common subexpressions will be eliminated. The pow function, on some (many ?) platforms/toolchains, is provided by a library the compiler has no control onto.
On other implementations though, the compiler may turn those pow calls into multiplications, or at least into intrinsics, which may in turn specialize for integer exponents. Your mileage will vary.
The first thing I'd do is to replace calls to pow by multiplications. For larger exponents, you may also do, eg.
double x2 = x * x;
double x3 = x * x2;
double x4 = x2 * x2;
Note that (credits to #Stephen Canon) doing repeated multiplications (with the above quick exponentiation scheme) will introduce roundoff error whose magnitude is proportional to the number of multiplications (ie. O(log exponent)). This error is typically tolerable, but pow guarantees exactness within one unit of least precision.
The compiler may perform common subexpression elimination- remember that it can't guarantee that all functions are re-entrant, but if pow is inlined, then it may well do this.
A good way to compute polynomials is Horner's rule. (eg here) which doesn't require pow() or any extra memory.
Your expression is x*y times a polynomial in y each of whose coefficients is a polynomial in x.
Each of these coefficients can be calculated using Horner with 8 multiplies and additions, and the polynomial in y with 8 more multiplies and additions for a total of 74 multiplies and 72 additions , whereas your sample code looks to me like more that 200 multiplications and more than a hundred calls to pow().
pow may be optimized away depending on the toolchain. The only way you can tell is to try it and see.
In the general case, unless the implementation of pow is visible to the compiler as a macro or inline, then the compiler can't cache the result as it doesn't know what side-effects the function may have.
Profile, find out where the bottlenecks are.
If the sub-expressions are used frequently, it may make sense to cache or store the intermediate values. However, accessing these values may take more time than letting the values sit in a data pipeline within the processor. Data fetches outside of the processor are much slower than fetching from its internal data cache.
Also try using Algebra to simplify the mathematical expressions. Perhaps even Linear Algebra to find some more efficient matrix expressions.
You may want to isolate the calculations to expressions involving one variable. Compilers can optimize code better when only one variable is used or changing at a time. For example, substitute the y variable with expressions involving x, if possible. This would reduce to an expression only involving x.
Also search the web for "data driven design" or "data oriented design". These sites show how to optimize code for data centric applications.

A good way to do a fast divide in C++?

Sometimes I see and have used the following variation for a fast divide in C++ with floating point numbers.
// orig loop
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}
// alternative
double y = 44100;
double y_div = 1.0 / y;
for(int i=0; i<10000; ++i) {
double z = x * y_div;
}
But someone hinted recently that this might not be the most accurate way.
Any thoughts?
On just about every CPU, a floating point divide is several times as expensive as a floating point multiply, so multiplying by the inverse of your divisor is a good optimization. The downside is that there is a possibility that you will lose a very small portion of accuracy on certain processors - eg, on modern x86 processors, 64-bit float operations are actually internally computed using 80 bits when using the default FPU mode, and storing it off in a variable will cause those extra precision bits to be truncated according to your FPU rounding mode (which defaults to nearest). This only really matters if you are concatenating many float operations and have to worry about the error accumulation.
Wikipedia agrees that this can be faster. The linked article also contains several other fast division algorithms that might be of interest.
I would guess that any industrial-strength modern compiler will make that optimization for you if it is going to profit you at all.
Your original
// original loop:
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}
can easily be optimized to
// haha:
double y = 44100.0;
double z = x / y;
and the performance is pretty nice. ;-)
EDIT: People keep voting this down, so here's the not so funny version:
If there were a general way to make division faster for all cases, don't you think compiler writers might have happened upon it by now? Of course they would have done. Also, some of the people doing FPU circuits aren't exactly stupid, either.
So the only way you're going to get better performance is to know what specific situation you have at hand and doing optimal code for that. Most likely this is a complete waste of your time, because your program is slow for some other reason such as performing math on loop invariants. Go find a better algorithm instead.
In your example using gcc the division with the options -O3 -ffast-math yields the same code as the multiplication without -ffast-math. (In a testing environment with enough volatiles around that the loop is still there.)
So if you really want to optimise those divisions away and don’t care about the consequences, that’s the way to go. Multiplication seems to be roughly 15 times faster, btw.
multiplication is faster than division so the second method is faster. It might be slightly less accurate but unless you are doing hard core numerics the level of accuracy should be more than enough.
When processing audio, I prefer to use fixed point math instead. I suppose this depends on the level of precision you need. But, let's assume that 16.16 fixed point integers will do (meaning high 16 bits is whole number, low 16 is the fraction). Now, all calculation can be done as simple integer math:
unsigned int y = 44100 << 16;
unsigned int z = x / (y >> 16); // divisor must be the whole number portion
Or with macros to help:
#define FP_INT(x) (x << 16)
#define FP_MUL(x, y) (x * (y >> 16))
#define FP_DIV(x, y) (x / (y >> 16))
unsigned int y = FP_INT(44100);
unsigned int z = FP_MUL(x, y);
Audio, hunh? It's not just 44,100 divisions per second when you have, say, five tracks of audio running at once. Even a simple fader consumes cycles, after all. And that's just for a fairly bare-bones, minimal example -- what if you want to have, say, an eq and a compressor? Maybe a little reverb? Your total math budget, so to speak, gets eaten up quickly. It does make sense to wring out a little extra performance in those cases.
Profilers are good. Profilers are your friend. Profilers deserve blowjobs and pudding. But you already know where the main bottle neck is in audio work -- it's in the loop that processes samples, and the faster you can make that, the happier your users will be. Use everything you can! Multiply by reciprocals, shift bits whenever possible (exp(x*y) = exp (x)*exp(y), after all), use lookup tables, refer to variables by reference instead of values (less pushing/popping on the stack), refactor terms, and so forth. (If you're good, you'll laugh at these elementary optimizations.)
I presume from the original post that x is not a constant shown there but probably data from an array so x[i] is likely to be the source of the data and similarly for the output, it will be stored somewhere in memory.
I suggest that if the loop count really is 10,000 as in the original post that it will make little difference which you use as the whole loop won't even take a fraction of millisecond anyway on a modern cpu. If the loop count really is very much higher, perhaps 1,000,000 or more, then I would expect that the cost of memory access would likely make the faster operation completely irrelevent anyway as it will always be waiting for the data anyway.
I suggest trying both with your code and testing if it actually makes any significant difference in run time and if it doesn't then just write the straightforward division if that's what the algorithm needs.
here's the problem with doing it with a reciprocal, you still have to do the division before you can actually divide by Y. unless your only dividing by Y then i suppose this may be useful. this is not very practical since division is done in binary with similar algorithms.
I are looping 10,000 times simply to make the code take long enough to measure the time easily? Or do you have 10000 numbers to divide by the same number? If the former, put the "y_div = 1.0 / y;" inside the loop, because it's part of the operation.
If the latter, yes, floating point multiplication is generally faster than division. Don't change your code from the obvious to the arcane based on guesses, though. Benchmark first to find slow spots, and then optimize those (and take measurements before and after to make sure your idea actually causes an improvement)
On old CPUs like the 80286, floating point maths was abysmally slow and we employed lots of trickiness to speed things up.
On modern CPUs floating point maths is blindingly fast and optimising compilers can generally do wonders with fine-tuning things.
It is almost never worth the effort to employ little micro-optimisations like that.
Try to make your code simple and idiot-proof. Only of you find a real bottleneck (using a profiler) would you think of optimisations in your floating point calculations.