performance difference in two almost same loop - c++

I got two almost same loop, but with remarkable difference in performance, both tested with MSVC2010, on system ~2.4 GHZ and 8GB RAM
Below loop take around 2500 ms to execute
for (double count = 0; count < ((2.9*4/555+3/9)*109070123123.8); count++)
;
And this loop execute in less then 1 ms
for (double count = ((2.9*4/555+3/9)*109070123123.8); count >0; --count)
;
What making such huge difference here? One got post increment and other using pre-increment can it result in such huge difference?

You're compiling without optimizations, so the comparison is futile. (If you did have optimizations on, that code would just be cut out completely).
Without optimization, the computation is likely executed at each iteration in the first loop, whereas the second loop only does the computation once, when it first initializes count.
Try changing the first loop to
auto max = ((2.9*4/555+3/9)*109070123123.8);
for (double count = 0; count < max; count++)
;
and then stop profiling debug builds.

In the first loop count < ((2.9*4/555+3/9)*109070123123.8) is computed every time round the loop where as in the second count = ((2.9*4/555+3/9)*109070123123.8) is calculated once and decremented each time round the loop.

Related

reason why this code is considered optimized?

I am working on optimization of some code and came across this, could someone tell me why this piece of code is more 'optimized'
for (i = 0; i < 1000; i+=2){
float var = numberOfEggs*arrayX[i] + arrayY[i];
arrayY[i+1] = var;
arrayY[i+2] = numberOfEggs*arrayX[i+1] + var;
}
than this version?
for(long i = 0; i < 1000 ; ++i)
arrayY[i+1] = numberOfEggs*arrayX[i] + arrayY[i];
any help is appreciated thank you!
The first example is performing two assignments per iteration. You can tell by the increment statement.
This is called loop unrolling. By performing two assignments per iteration, you are removing half of the branches.
Most processors don't like branch instructions. The processor needs to determine whether or not to reload the instruction cache (branch prediction). There are at least two branches per iteration. The first is for the comparison, the second is to loop back to the comparison.
To experiment, try using 4 assignments per iteration, and profile.

sqrt time complexity comparison

Is this loop --
for(int i = 0 ; i < sqrt(number) ; i++)
{
//some operations
}
faster than this --
int length = sqrt(number);
for(int i = 0 ; i < length ; i++)
{
//some operations
}
I was getting TLE in a code in an online judge but when i replaced sqrt in loop with length i got it accepted.
Can u please point out the time complexity of the loop with sqrt considering number to be <=1000000000
The time complexity itself isn't different between the two loops (unless the complexity of sqrt itself is dependent on the number) but what is different is how many times you're computing the square root.
Without optimisation like the compiler automatically moving loop invariant stuff outside of the loop (assuming that's even allowed in this case since the compiler would have to check a lot of things to ensure they can't affect the result or side effects of the sqrt call), the following code will calculate the square root about a thousand times (once per iteration):
number = 1000000;
for(int i = 0 ; i < sqrt(number) ; i++) { ... }
However, this code will only calculate it once:
number = 1000000;
root = sqrt(number);
for(int i = 0 ; i < root ; i++) { ... }
The problem is that the whole expression i < sqrt(number) must be evaluated repeatedly in the original code, while sqrt is evaluated only once in the modified code.
Well, the recent compilers are usually able to optimize the loop so the sqrt is evaluated only once before the loop, but do you want to rely on them?
The first version forces the compiler to generate code that executes sqrt(number) every time the condition is tested (as many times as the for is looped).
The second version only calculates the length once (single call to sqrt).

call a function and loops in parallel

I don't have any experience in openMP , so I want to kow how to do the following:
for (int i = 1; i <= NumImages; i++) {
//call a function
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
for (int l = 0; l < ElNum ; l++) {
//do 2 summing up calculations inside a while loop
}//end k loop
}//end i loop
Now , I have 40 cores in my disposal.
NumImages will be from 50 to 150 ,more usual 150.
SumNumber will be around 200.
ElNum will be around 5000.
So , the best dealing with this is assigning every thread to a function call and also execute in parallel the l loop?
And if yes , it will be like:
#pragma omp parallel for num_threads(40)
for (int i = 1; i <= NumImages; i++) {
myfunction(...);
for (int k = 0 ; k < SumNumber k++) {
#pragma omp for
for (int l = 0; l < ElNum ; l++) {
And the above means (for NumImages = 150) that myfunction will be executed 40 times in parallel and also l loop and then ,when l loop and k loop finishes , the next 40 threads will call again the function and the next 40 , so 3*40 = 120 and then the next 30?
Generally the best way is the way that splits the work evenly, to maintain efficiency (no cores are waiting). E.g. in your case probably static scheduling is not a good idea, because 40 does not divide 150 evenly, for the last iteration you would loose 25% of computing power. So it might turn out, that it would be better to put parallel clause before second loop. It all the depends on the mode you choose, and how really work is distributed within loops. E.g., If myfunction does 99% then its a bad idea, if 99% of work is within 2 inner loops it might be good.
Not really. There are 3 scheduling modes. But none of them works in a way, that it blocks other threads. There is a pool of tasks (iterations) that is distributed among the threads. Scheduling mode describes the strategy of assigning tasks to threads. When one thread finishes, it just gets next task, no waiting. The strategies are described in more detail here: http://en.wikipedia.org/wiki/OpenMP#Scheduling_clauses (I am not sure if balant-copy paste from wiki is a good idea, so I'll leave a link. It's a good material.)
Maybe what is not written there is that the modes overhead are presented in order of the amount of overhead they introduce. static is fastest, then dynamic, then guided. My advice when to use which would be, this is not the exact best, but good rule of thumb IMO:
static if you know will be divided evenly among the threads and take the same amount of time
dynamic if you know the tasks will not be divided evenly or their execution times are not even
guided for rather long tasks that you pretty much cannot tell anything
If your tasks are rather small you can see an overhead even for static scheduling (E.g. why my OpenMP C++ code is slower than a serial code?), but I think in your case dynamic should be fine and best choice.

Why is my for loop of cilk_spawn doing better than my cilk_for loop?

I have
cilk_for (int i = 0; i < 100; i++)
x = fib(35);
the above takes 6.151 seconds
and
for (int i = 0; i < 100; i++)
x = cilk_spawn fib(35);
takes 5.703 seconds
The fib(x) is the horrible recursive Fibonacci number function. If I dial down the fib function cilk_for does better than cilk_spawn, but it seems to me that regardless of the time it takes to do fib(x) cilk_for should do better than cilk_spawn.
What don't I understand?
Per comments, the issue was a missing cilk_sync. I'll expand on that to point out exactly how the ratio of time can be predicted with surprising accuracy.
On a system with P hardware threads (typically 8 on a i7) for/cilk_spawn code will execute as follows:
The initial thread will execute the iteration for i=0, and leave a continuation that is stolen by some other thread.
Each thief will steal an iteration and leave a continuation for the next iteration.
When each thief finishes an iteration, it goes back to step 2, unless there are no more iterations to steal.
Thus the threads will execute the loop hand-over-hand, and the loop exits at a point where P-1 threads are still working on iterations. So the loop can be expected to finish after evaluating only (100-P-1) iterations.
So for 8 hardware threads, the for/cilk_spawn with missing cilk_sync should take about 93/100 of the time for the cilk_for, quite close to the observed ratio of about 5.703/6.151 = 0.927.
In contrast, in a "child steal" system such as TBB or PPL task_group, the loop will race to completion, generating 100 tasks, and then keep going until a call to task_group::wait. In that case, forgetting the synchronization would have led to a much more dramatic ratio of times.

Benchmark code - dividing by the number of iterations or not?

I had an interesting discussion with my friend about benchmarking a C/C++ code (or code, in general). We wrote a simple function which uses getrusage to measure cpu time for a given piece of code. (It measures how much time of cpu it took to run a specific function). Let me give you an example:
const int iterations = 409600;
double s = measureCPU();
for( j = 0; j < iterations; j++ )
function(args);
double e = measureCPU();
std::cout << (e-s)/iterations << " s \n";
We argued, should we divide (e-s) by the number of iterations, or not? I mean, when we dont divide it the result is in acceptable form (ex. 3.0 s) but when we do divide it, it gives us results like 2.34385e-07 s ...
So here are my questions:
should we divide (e-s) by the number of iterations, if so, why?
how can we print 2.34385e-07 s in more human-readable form? (let's say, it took 0.00000003 s) ?
should we first make a function call for once, and after that measure cpu time for iterations, something like this:
// first function call, doesnt bother with it at all
function(args);
// real benchmarking
const int iterations = 409600;
double s = measureCPU();
for( j = 0; j < iterations; j++ )
function(args);
double e = measureCPU();
std::cout << (e-s)/iterations << " s \n";
if you divide the time by number of iterations, then you'll get iteration independent comparison of run time of one function, the more iterations, the more precise result. EDIT: its an average run time over n iterations.
you can multiply the divided time by 1e6 to get microseconds per one iteration unit (i assume that measureCPU returns secods)
std::cout << 1e6*(e-s)/iterations << " s \n";
as #ogni42 stated, you are getting an overhead from for loop into your measured time, so you could try to unroll the loop a bit to lower the measurement error, do a 8 to 16 calls each iteration, try different call counts to see how the measured time changes:
for( j = 0; j < iterations; j++ ) {
function(args);
function(args);
function(args);
function(args);
...
}
What you basically get is a lower is better number. If you wanted higher is better scoring you could
measure diferent variations of function and then get the time of the fastest one. This one could score 10 points.
score_for_actual_function = 10.0 * fastest_time / time_of_actual_function
This scoring is kind of time independent, so you can just compare different function variations and the function can score less than one point... and beware of division by zero :)