I am trying to test a series of libraries for matrix-vector computations. For that I just make a large loop and inside I call the routine I want to time. Very simple. However I sometimes see that when I increase the level of optimization for the compiler the time drops to zero no matter how large the loop is. See the example below where I try to time a C macro to compute cross products. What is the compiler doing? how can I avoid it but to allow maximum optimization for floating point arithmetics? Thank you in advance
The example below was compiled using g++ 4.7.2 on a computer with an i5 intel processor.
Using optimization level 1 (-O1) it takes 0.35 seconds. For level two or higher it drops down to zero. Remember, I want to time this so I want the computations to actually happen even if, for this simple test, unnecessary.
#include<iostream>
using namespace std;
typedef double Vector[3];
#define VecCross(A,assign_op,B,dummy_op,C) \
( A[0] assign_op (B[1] * C[2]) - (B[2] * C[1]), \
A[1] assign_op (B[2] * C[0]) - (B[0] * C[2]), \
A[2] assign_op (B[0] * C[1]) - (B[1] * C[0]) \
)
double get_time(){
return clock()/(double)CLOCKS_PER_SEC;
}
int main()
{
unsigned long n = 1000000000u;
double start;
{//C macro cross product
Vector u = {1,0,0};
Vector v = {1,1,0};
Vector w = {1.2,1.2,1.2};
start = get_time();
for(unsigned long i=0;i<n;i++){
VecCross (w, =, u, X, v);
}
cout << "C macro cross product: " << get_time()-start << endl;
}
return 0;
}
Ask yourself, what does your program actually do, in terms of what is visible to the end-user?
It displays the result of a calculation: get_time()-start. The contents of your loop have no bearing on the outcome of that calculation, because you never actually use the variables being modified inside the loop.
Therefore, the compiler optimises out the entire loop since it is irrelevant.
One solution is to output the final state of the variables being modified in the loop, as part of your cout statement, thus forcing the compiler to compute the loop. However, a smart compiler could also figure out that the loop always calculates the same thing, and it can simply insert the result directly into your cout statement, because there's no need to actually calculate it at run-time. As a workaround to this, you could for example require that one of the inputs to the loop be provided at run-time (e.g. read it in from a file, command line argument, cin, etc.).
For more (and possibly better) solutions, check out this duplicate thread: Force compiler to not optimize side-effect-less statements
Related
I'm trying to speed up a big block of code across many files and found out that one function uses about 70% of the total time. This is because this function is called 477+ million times.
The pointer array par can only be one of two presets, either
par[0] = 0.057;
par[1] = 2.87;
par[2] = -3.;
par[3] = -0.03;
par[4] = -3.05;
par[5] = -3.5;
OR
par[0] = 0.043;
par[1] = 2.92;
par[2] = -3.21;
par[3]= -0.065;
par[4] = -3.00;
par[5] = -2.65;
So I've tried plugging in numbers depending on which preset it is but have failed to find any significant time saves.
The pow and exp functions seem to be called about every time and they take up about 40 and 20 percent of the total time respectively, so only 10% of the total time is used by the parts of this function that aren't pow or exp. Finding ways to speed those up would probably be the best but none of the exponents used in pow are integers except -4 and I don't know if 1/(x*x*x*x) is faster than pow(x, -4).
double Signal::Param_RE_Tterm_approx(double Tterm, double *par) {
double value = 0.;
// time after Che angle peak
if (Tterm > 0.) {
if ( fabs(Tterm/ *par) >= 1.e-2) {
value += -1./(*par)*exp(-1.*Tterm/(*par));
}
else {
value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
}
if ( fabs(Tterm* *(par+1)) >= 1.e-2) {
value += *(par+2)* *(par+1)*pow( 1.+*(par+1)*Tterm, *(par+2)-1. );
}
else {
value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
}
}
// time before Che angle peak
else {
if ( fabs(Tterm/ *(par+3)) >= 1.e-2 ) {
value += -1./ *(par+3) *exp(-1.*Tterm/ *(par+3));
}
else {
value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
}
if ( fabs(Tterm* *(par+4) >= 1.e-2 ) {
value += *(par+5)* *(par+4) *pow( 1.+ *(par+4)*Tterm, *(par+5)-1. );
}
else {
value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
}
}
return value * 1.e9;
}
I first rewrote it to be a bit easier to follow:
#include <math.h>
double Param_RE_Tterm_approx(double Tterm, double const* par) {
double value = 0.;
if (Tterm > 0.) {
// time after Che angle peak
if ( fabs(Tterm/ par[0]) >= 1.e-2) {
value += -1./(par[0])*exp(-1.*Tterm/(par[0]));
} else {
value += -1./par[0]*(1. - Tterm/par[0] + Tterm*Tterm/(par[0]*par[0]*2.) - Tterm*Tterm*Tterm/(par[0]*par[0]*par[0]*6.) );
}
if ( fabs(Tterm* par[1]) >= 1.e-2) {
value += par[2]* par[1]*pow( 1.+par[1]*Tterm, par[2]-1. );
} else {
value += par[2]*par[1]*( 1.+(par[2]-1.)*par[1]*Tterm + (par[2]-1.)*(par[2]-1.-1.)/2.*par[1]*par[1]*Tterm*Tterm + (par[2]-1.)*(par[2]-1.-1.)*(par[2]-1.-2.)/6.*par[1]*par[1]*par[1]*Tterm*Tterm*Tterm );
}
} else {
// time before Che angle peak
if ( fabs(Tterm/ par[3]) >= 1.e-2 ) {
value += -1./ par[3] *exp(-1.*Tterm/ par[3]);
} else {
value += -1./par[3]*(1. - Tterm/par[3] + Tterm*Tterm/(par[3]*par[3]*2.) - Tterm*Tterm*Tterm/(par[3]*par[3]*par[3]*6.) );
}
if ( fabs(Tterm* par[4]) >= 1.e-2 ) {
value += par[5]* par[4] *pow( 1.+ par[4]*Tterm, par[5]-1. );
} else {
value += par[5]*par[4]*( 1.+(par[5]-1.)*par[4]*Tterm + (par[5]-1.)*(par[5]-1.-1.)/2.*par[4]*par[4]*Tterm*Tterm + (par[5]-1.)*(par[5]-1.-1.)*(par[5]-1.-2.)/6.*par[4]*par[4]*par[4]*Tterm*Tterm*Tterm );
}
}
return value * 1.e9;
}
We can then look at its structure.
There are two main branches -- Tterm negative (before) and positive (after). These correspond to using 0,1,2 or 3,4,5 in the par array.
Then in each case we do two things to add to value. In both cases, for small cases we use a polynomial, and for big cases we use an exponential/power equation.
As a guess, this is because the polynomial is a decent approximation for the exponential for small values -- the error is acceptable. What you should do is confirm that guess -- take a look at the Taylor series expansion of the "big" power/exponent based equation, and see if it agrees with the polynomials somehow. Or check numerically.
If it is the case, this means that this equation has a known amount of error that is acceptable. Quite often there are faster versions of exp or pow that have a known amount of max error; consider using those.
If this isn't the case, there still could be an acceptable amount of error, but the Taylor series approximation can give you "in code" information about what is an acceptable amount of error.
A next step I'd take is to tear the 8 pieces of this equation apart. There is positive/negative, the first and second value+= in each branch, and then the polynomial/exponential case.
I'm guesing the fact that exp is taking ~1/3 the time of pow is because you have 3 calls to pow to 1 call to exp in your function, but you might find out something interesting like "all of our time is actually in the Tterm > 0. case" or what have you.
Now examine call sites. Is there a pattern in the Tterm you are passing this function? Ie, do you tend to pass Tterms in roughly sorted order? If so, you can do the test for which function to call outside of calling this function, and do it in batches.
Simply doing it in batches and compiling with optimization and inlining the bodies of the functions might make a surprising amount of difference; compilers are getting better at vectorizing work.
If that doesn't work, you can start threading things off. On a modern computer you can have 4-60 threads solving this problem independently, and this problem looks like you'd get nearly linear speedup. A basic threading library, like TBB, would be good for this kind of task.
For the next step up, if you are getting large batches of data and you need to do a lot of processing, you can stuff it onto a GPU and solve it there. Sadly, GPU<->RAM communication is small, so simply doing the math in this function on the GPU and reading/writing back and forth with RAM won't give you much if any performance. But if more work than just this can go on the GPU, it might be worth it.
The only 10% of the total time is used by the parts of this function that aren't pow or exp.
If your function performance bottleneck is exp(), pow() execution, consider using vector instructions in your calculations. All modern processors support at least SSE2 instruction set, so this approach will definitely give at least ~2x speed up, because your calculation could be easily vectorized.
I recommend you to use this c++ vectorization library, which contains all standard mathematical functions (such as exp and pow) and allows to write code in OOP-style without using assembly language . I used it several times and it must work perfectly in your problem.
If you have GPU, you should also consider trying cuda framework, because, again, your problem could be perfectly vectorized. Moreover, If this function is called 477+ million times, GPU will literally eliminate your problem...
(Partial optimization:)
The longest expression has
Common subexpressions
Polynomial evaluated the costly way.
Pre-define these (perhaps add them to par[]):
a = par[5]*par[4];
b = (par[5]-1.);
c = b*(par[5]-2.)/2.;
d = c*(par[5]-3.)/3.;
Then, for example, the longest expression becomes:
e = par[4]*Tterm;
value += a*(((d*e + c)*e + b)*e + 1.);
And simplify the rest.
If the expressions are curve-fitting approximations, why not do also with
value += -1./(*par)*exp(-1.*Tterm/(*par));
You should also ask whether all 477M iterations are needed.
If you want to explore batching / more optimization opportunities for fusing in computations that depend on these values, try using Halide
I've rewritten your program in Halide here:
#include <Halide.h>
using namespace Halide;
class ParamReTtermApproxOpt : public Generator<ParamReTtermApproxOpt>
{
public:
Input<Buffer<float>> tterm{"tterm", 1};
Input<Buffer<float>> par{"par", 1};
Input<int> ncpu{"ncpu"};
Output<Buffer<float>> output{"output", 1};
Var x;
Func par_inv;
void generate() {
// precompute 1 / par[x]
par_inv(x) = fast_inverse(par(x));
// after che peak
Expr after_che_peak = tterm(x) > 0;
Expr first_term = -par_inv(0) * fast_exp(-tterm(x) * par_inv(0));
Expr second_term = par(2) * par(1) * fast_pow(1 + par(1) * tterm(x), par(2) - 1);
// before che peak
Expr third_term = -par_inv(3) * fast_exp(-tterm(x) * par_inv(3));
Expr fourth_term = par(5) * par(4) * fast_pow(1 + par(4) * tterm(x), par(5) - 1);
// final value
output(x) = 1.e9f * select(after_che_peak, first_term + second_term,
third_term + fourth_term);
}
void schedule() {
par_inv.bound(x, 0, 6);
par_inv.compute_root();
Var xo, xi;
// break x into two loops, one for ncpu tasks
output.split(x, xo, xi, output.extent() / ncpu)
// mark the task loop parallel
.parallel(xo)
// vectorize each thread's computation for 8-wide vector lanes
.vectorize(xi, 8);
output.print_loop_nest();
}
};
HALIDE_REGISTER_GENERATOR(ParamReTtermApproxOpt, param_re_tterm_approx_opt)
I can run 477,000,000 iterations in slightly over one second on my Surface Book (with ncpu=4). Batching is hugely important here since it enables vectorization.
Note that the equivalent program written using double arithmetic is much slower (20x) than float arithmetic. Though Halide doesn't supply fast_ versions for doubles, so this might not be quite apples-to-apples. Regardless, I would check whether you need the extra precision.
Editor's clarification: When this was originally posted, there were two issues:
Test performance drops by a factor of three if seemingly inconsequential statement added
Time taken to complete the test appears to vary randomly
The second issue has been solved: the randomness only occurs when running under the debugger.
The remainder of this question should be understood as being about the first bullet point above, and in the context of running in VC++ 2010 Express's Release Mode with optimizations "Maximize Speed" and "favor fast code".
There are still some Comments in the comment section talking about the second point but they can now be disregarded.
I have a simulation where if I add a simple if statement into the while loop that runs the actual simulation, the performance drops about a factor of three (and I run a lot of calculations in the while loop, n-body gravity for the solar system besides other things) even though the if statement is almost never executed:
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
with time and cb_last_orbital_update being both of type double and defined in the beginning of the main function, where this if statement is too. Usually there are computations I want to run there too, but it makes no difference if I delete them. The if statement as it is above has the same effect on the performance.
The variable time is the simulation time, it increases in 0.001 steps in the beginning so it takes a really long time until the if statement is executed for the first time (I also included printing a message to see if it is being executed, but it is not, or at least only when it's supposed to). Regardless, the performance drops by a factor of 3 even in the first minutes of the simulation when it hasn't been executed once yet. If I comment out the line
cb_last_orbital_update = time;
then it runs faster again, so it's not the check for
time - cb_last_orbital_update > 5000000
either, it's definitely the simple act of writing current simulation time into this variable.
Also, if I write the current time into another variable instead of cb_last_orbital_update, the performance does not drop. So this might be an issue with assigning a new value to a variable that is used to check if the "if" should be executed? These are all shots in the dark though.
Disclaimer: I am pretty new to programming, and sorry for all that text.
I am using Visual C++ 2010 Express, deactivating the stdafx.h precompiled header function didn't make a difference either.
EDIT: Basic structure of the program. Note that nowhere besides at the end of the while loop (time += time_interval;) is time changed. Also, cb_last_orbital_update has only 3 occurrences: Declaration / initialization, plus the two times in the if statement that is causing the problem.
int main(void)
{
...
double time = 0;
double time_interval = 0.001;
double cb_last_orbital_update = 0;
F_Rocket_Preset(time, time_interval, ...);
while(conditions)
{
Rocket[active].Stage[Rocket[active].r_stage].F_Update_Stage_Performance(time, time_interval, ...);
Rocket[active].F_Calculate_Aerodynamic_Variables(time);
Rocket[active].F_Calculate_Gravitational_Forces(cb_mu, cb_pos_d, time);
Rocket[active].F_Update_Rotation(time, time_interval, ...);
Rocket[active].F_Update_Position_Velocity(time_interval, time, ...);
Rocket[active].F_Calculate_Orbital_Elements(cb_mu);
F_Update_Celestial_Bodies(time, time_interval, ...);
if (time - cb_last_orbital_update > 5000000.0)
{
cb_last_orbital_update = time;
}
Rocket[active].F_Check_Apoapsis(time, time_interval);
Rocket[active].F_Status_Check(time, ...);
Rocket[active].F_Update_Mass (time_interval, time);
Rocket[active].F_Staging_Check (time, time_interval);
time += time_interval;
if (time > 3.1536E8)
{
std::cout << "\n\nBreak main loop! Sim Time: " << time << std::endl;
break;
}
}
...
}
EDIT 2:
Here is the difference in the assembly code. On the left is the fast code with the line
cb_last_orbital_update = time;
outcommented, on the right the slow code with the line.
EDIT 4:
So, i found a workaround that seems to work just fine so far:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
EDIT 5:
While that workaround does work, it only works in combination with using __declspec(noinline). I just removed those from the function declarations again to see if that changes anything, and it does.
EDIT 6: Sorry this is getting confusing. I tracked down the culprit for the lower performance when removing __declspec(noinline) to this function, that is being executed inside the if:
__declspec(noinline) std::string F_Get_Body_Name(int r_body)
{
switch (r_body)
{
case 0:
{
return ("the Sun");
}
case 1:
{
return ("Mercury");
}
case 2:
{
return ("Venus");
}
case 3:
{
return ("Earth");
}
case 4:
{
return ("Mars");
}
case 5:
{
return ("Jupiter");
}
case 6:
{
return ("Saturn");
}
case 7:
{
return ("Uranus");
}
case 8:
{
return ("Neptune");
}
case 9:
{
return ("Pluto");
}
case 10:
{
return ("Ceres");
}
case 11:
{
return ("the Moon");
}
default:
{
return ("unnamed body");
}
}
}
The if also now does more than just increase the counter:
if(time - cb_orbit_update_counter * 1E7 > 0)
{
F_Update_Orbital_Elements_Of_Celestial_Bodies(args);
std::cout << F_Get_Body_Name(3) << " SMA: " << cb_sma[3] << "\tPos Earth: " << cb_pos_d[3][0] << " / " << cb_pos_d[3][1] << " / " << cb_pos_d[3][2] <<
"\tAlt: " << sqrt(pow(cb_pos_d[3][0] - cb_pos_d[0][0],2) + pow(cb_pos_d[3][1] - cb_pos_d[0][1],2) + pow(cb_pos_d[3][2] - cb_pos_d[0][2],2)) << std::endl;
std::cout << "Time: " << time << "\tcb_o_h[3]: " << cb_o_h[3] << std::endl;
cb_orbit_update_counter++;
}
I remove __declspec(noinline) from the function F_Get_Body_Name alone, the code gets slower. Similarly, if i remove the execution of this function or add __declspec(noinline) again, the code runs faster. All other functions still have __declspec(noinline).
EDIT 7:
So i changed the switch function to
const std::string cb_names[] = {"the Sun","Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune","Pluto","Ceres","the Moon","unnamed body"}; // global definition
const int cb_number = 12; // global definition
std::string F_Get_Body_Name(int r_body)
{
if (r_body >= 0 && r_body < cb_number)
{
return (cb_names[r_body]);
}
else
{
return (cb_names[cb_number]);
}
}
and also made another part of the code slimmer. The program now runs fast without any __declspec(noinline). As ElderBug suggested, an issue with the CPU instruction cache then / the code getting too big?
I'd put my money on Intel's branch predictor. http://en.wikipedia.org/wiki/Branch_predictor
The processor assumes (time - cb_last_orbital_update > 5000000) to be false most of the time and loads up the execution pipeline accordingly.
Once the condition (time - cb_last_orbital_update > 5000000) comes true. The misprediction delay is hitting you. You may loose 10 to 20 cycles.
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
Something is happening that you don't expect.
One candidate is some uninitialised variables hanging around somewhere, which have different values depending on the exact code that you are running. For example, you might have uninitialised memory that is sometime a denormalised floating point number, and sometime it's not.
I think it should be clear that your code doesn't do what you expect it to do. So try debugging your code, compile with all warnings enabled, make sure you use the same compiler options (optimised vs. non-optimised can easily be a factor 10). Check that you get the same results.
Especially when you say "it runs faster again (this doesn't always work though, but i can't see a pattern). Also worked with changing 5000000 to 5E6 once. It only runs fast once though, recompiling causes the performance to drop again without changing anything. One time it ran slower only after recompiling twice." it looks quite likely that you are using different compiler options.
I will try another guess. This is hypothetical, and would be mostly due to the compiler.
My guess is that you use a lot of floating point calculations, and the introduction and use of double values in your main makes the compiler run out of XMM registers (the floating point SSE registers). This force the compiler to use memory instead of registers, and induce a lot of swapping between memory and registers, thus greatly reducing the performance. This would be happening mainly because of the computations functions inlining, because function calls are preserving registers.
The solution would be to add __declspec(noinline) to ALL your computation functions declarations.
I suggest using the Microsoft Profile Guided Optimizer -- if the compiler is making the wrong assumption for this particular branch it will help, and it will in all likelihood improve speed for the rest of the code as well.
Workaround, try 2:
The code is now looking like this:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
So far it runs fast, plus the code is being executed when it should as far as i can tell. Again only a workaround, but if this proves to work all around then i'm satisfied.
After some more testing, seems good.
My guess is that this is because the variable cb_last_orbital_update is otherwise read-only, so when you assign to it inside the if, it destroys some optimizations that the compiler has for read-only variables (e.g. perhaps it's now stored in memory instead of a register).
Something to try (although this might still not work) is to make a third variable that is initialized via cb_last_orbital_update and time depending on whether the condition is true, and using that one instead. Presumably, the compiler would now treat that variable as a constant, but I'm not sure.
In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.
I'm writing a class in windows using visual studio, one of it's public function has a big for loop looks like below,
void brain_network_opencl::block_filter_fcd_all(int m)
{
const int m_block_len = m * block_len;
time_t start, end;
for (int j = 0; j < shift_2d_gpu[1]; j++) // local work size/number of rows per block
{
for (int i = 0; i < masksize; i++) // number of extracted voxels
{
if (j + m_block_len != i)
{
//if (floor(dst_ptr_gpu[i + j * masksize] * power_up) > threadhold_fcd)
if ((int)(dst_ptr_gpu[i + j * masksize] * power_up) > threadhold_fcd)
{
org_row = mask_ind[j + m_block_len];
org_col = mask_ind[i];
nodes.insert(org_row);
conns.insert(make_pair(org_row, org_col));
}
}
}
}
end = clock();
cout << end - start << "ms" << " for block" << j << endl;
}
where nodes is std::set<set> ,conns is std::multimap<int, int> and mask_ind is std::vector<int>, they are declared as private variables as well as masksize and shift_2d_gpu;
Major time costs by floor and .insert;
The problem is, the same code (with all the variables) in a main function costs only 1/5~1 the time than it calls from here. And if I replace (int) by floor in both function and main(), it costs much more in this function;
What causes this problem and do I have to write it all inside a main()?
By the way does it has something to do with the overloads?
floor shows +3 overloads and .insert shows +5 overloads
updates
I copy the codes of this function to another new console project's main function.
It's still much slower than my first function (codes also in main)!!!
Now I'm confused...
It's there any settings that make floor and .insert faster?
updates 2014/03/31
It's because of the settings in Project Properties->Configuration Properties->C/C++->General->Debug Information Format, this value is set to P*rogram Database for Edit And Continue (/ZI)* as default and it is incompatible with a lot of optimizations according to msdn. If this value is set to Program Database (/Zi), the time cost of floor wouldn't be 10 times of (int).
(I looked into Disassembly and found out that the length of codes (call floor -> jmp floor ->different codes) are different when the setting is altered, that's the reason causes floor and .insert spent much more time than it should)
As Gassa has pointed out, to optimize the tight loop use a custom floor function.
set<int> isn't cache friendly, but to replace it with a cache-friendly structure you might need to alter the algorithm. Still, unordered_set<int>, with a decent space reserved to it, should be a bit better, having less cache misses per insert than a binary tree.
P.S. Non-virtual overloads in C++ are resolved at compile time and have no effect on performance
Okay so I was board and wondered how fast math.h square root was in comparison to the one with the magic number in it (made famous by Quake but made by SGI).
But this has ended up in a world of hurt for me.
I first tried this on the Mac where the math.h would win hands down every time then on Windows where the magic number always won, but I think this is all down to my own noobness.
Compiling on the Mac with "g++ -o sq_root sq_root_test.cpp" when the program ran it takes about 15 seconds to complete. But compiling in VS2005 on release takes a split second. (in fact I had to compile in debug just to get it to show some numbers)
My poor man's benchmarking? is this really stupid? cos I get 0.01 for math.h and 0 for the Magic number. (it cant be that fast can it?)
I don't know if this matters but the Mac is Intel and the PC is AMD. Is the Mac using hardware for math.h sqroot?
I got the fast square root algorithm from http://en.wikipedia.org/wiki/Fast_inverse_square_root
//sq_root_test.cpp
#include <iostream>
#include <math.h>
#include <ctime>
float invSqrt(float x)
{
union {
float f;
int i;
} tmp;
tmp.f = x;
tmp.i = 0x5f3759df - (tmp.i >> 1);
float y = tmp.f;
return y * (1.5f - 0.5f * x * y * y);
}
int main() {
std::clock_t start;// = std::clock();
std::clock_t end;
float rootMe;
int iterations = 999999999;
// ---
rootMe = 2.0f;
start = std::clock();
std::cout << "Math.h SqRoot: ";
for (int m = 0; m < iterations; m++) {
(float)(1.0/sqrt(rootMe));
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
// ---
std::cout << "Quake SqRoot: ";
rootMe = 2.0f;
start = std::clock();
for (int q = 0; q < iterations; q++) {
invSqrt(rootMe);
rootMe++;
}
end = std::clock();
std::cout << (difftime(end, start)) << std::endl;
}
There are several problems with your benchmarks. First, your benchmark includes a potentially expensive cast from int to float. If you want to know what a square root costs, you should benchmark square roots, not datatype conversions.
Second, your entire benchmark can be (and is) optimized out by the compiler because it has no observable side effects. You don't use the returned value (or store it in a volatile memory location), so the compiler sees that it can skip the whole thing.
A clue here is that you had to disable optimizations. That means your benchmarking code is broken. Never ever disable optimizations when benchmarking. You want to know which version runs fastest, so you should test it under the conditions it'd actually be used under. If you were to use square roots in performance-sensitive code, you'd enable optimizations, so how it behaves without optimizations is completely irrelevant.
Also, you're not benchmarking the cost of computing a square root, but of the inverse square root.
If you want to know which way of computing the square root is fastest, you have to move the 1.0/... division down to the Quake version. (And since division is a pretty expensive operation, this might make a big difference in your results)
Finally, it might be worth pointing out that Carmacks little trick was designed to be fast on 12 year old computers. Once you fix your benchmark, you'll probably find that it's no longer an optimization, because today's CPU's are much faster at computing "real" square roots.