Is looping faster than traversing one by one - c++

Let us consider the following code snippet in C++ to print the fist 10 positive integers :
for (int i = 1; i<11;i++)
{
cout<< i ;
}
Will this be faster or slower than sequentially printing each integer one by one as follow :
x =1;
cout<< x;
x++;
cout<< x;
And so on ..
Is there any reason as to why it should be faster or slower ? Does it vary from one language to another ?

This question is similar to this one; I've copied an excerpt of my answer to that question below... (the numbers are different; 11 vs. 50; the analysis is the same)
What you're considering doing is a manual form of loop unrolling. Loop unrolling is an optimization that compilers sometimes use for reducing the overhead involved in a loop. Compilers can do it only if the number of iterations of the loop can be known at compile time (i.e. the number of iterations is a constant, even if the constant involves computation based on other constants). In some cases, the compiler may determine that it is worthwhile to unroll the loop, but often it won't unroll it completely. For instance, in your example, the compiler may determine that it would be a speed advantage to unroll the loop from 50 iterations out to only 10 iterations with 5 copies of the loop body. The loop variable would still be there, but instead of doing 50 comparisons of the loop counter, now the code only has to do the comparison 10 times. It's a tradeoff, because the 5 copies of the loop body eat up 5 times as much space in the cache, which means that loading those extra copies of the same instructions forces the cache to evict (throw out) that many instructions that are already in the cache and which you might have wanted to stay in the cache. Also, loading those 4 extra copies of the loop body instructions from main memory takes much, much longer than simply grabbing the already-loaded instructions from the cache in the case where the loop isn't unrolled at all.
So all in all, it's often more advantageous to just use only one copy of the loop body and go ahead and leave the loop logic in place. (I.e. don't do any loop unrolling at all.)

In loop, the actual machine level instruction would be the same, and therefore the same address. In explicit statements, the instructions will have different addresses. So it is possible that for loops, the CPU's instruction cache will provide performance boost that might not happen for the latter case.
For really small range (10) the difference will most likely be negligible. For significant length of the loop it could show up more clearly.

Related

OpenMP first kernel much slower than the second kernel

I have a huge 98306 by 98306 2D array initialized. I created a kernel function that counts the total number of elements below a certain threshold.
#pragma omp parallel for reduction(+:num_below_threshold)
for(row)
for(col)
index = get_corresponding_index(row, col);
if (array[index] < threshold)
num_below_threshold++;
For benchmark purpose I measured the execution time of the kernel executing when the number of thread is set to 1. I noticed that the first time the kernel executes it took around 11 seconds. The next call to the kernel executing on the same array with one thread only took around 3 seconds. I thought it might be a problem related to cache but it doesn't seem to be related. What is the possible reasons that caused this?
This array is initialized as:
float *array = malloc(sizeof(float) * 98306 * 98306);
for (int i = 0; i < 98306 * 98306; i++) {
array[i] = rand() % 10;
}
This same kernel is applied to this array twice and the second execution time is much faster than the first kernel. I though of lazy allocation on Linux but that shouldn't be a problem because of the initialization function. Any explanations will be helpful. Thanks!
Since you don't provide any Minimal, Complete and Verifiable Example, I'll have to make some wild guesses here, but I'm pretty confident I have the gist of the issue.
First, you have to notice that 98,306 x 98,306 is 9,664,069,636 which is way larger than the maximum value a signed 32 bit integer can store (which is 2,147,483,647). Therefore, the upper limit of your for initialization loop, after overflowing, could become 1,074,135,044 (as on my machines, although it is undefined behavior so strictly speaking, anything could happen), which is roughly 9 times smaller than what you expected.
So now, after the initialization loop, only 11% of the memory you thought you allocated has actually been allocated and touched by the operating system. However, your first reduction loop does a good job in going over the various elements of the array, and since for about 89% of it, it's for the fist time, the OS does the actual memory allocation there and then, which takes some significant amount of time.
And now, for your second reduction loop, all memory has been properly allocated and touched, which makes it much faster.
So that's what I believe happened. That said, many other parameters can enter into play here, such as:
Swapping: the array you try to allocate represents about 36GB of memory. If your machine doesn't have that much memory available, then your code might swap, which will potentially make a big mess of whatever performance measurement you can come up with
NUMA effect: if your machine has multiple NUMA nodes, then thread pinning and memory affinity, when not managed properly, can have a large impact on performance between loop occurrences
Compiler optimization: you didn't mention which compiler you used and which level of optimization you requested. Depending on that, you'd be amazed on how shortened your code could become. For example, the compiler could totally remove the second loop as it does the same thing as the first and becomes useless as the result will be the same... And many other interesting and unexpected things which render your benchmark meaningless

Should I use loops for small iterations? [duplicate]

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?
Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)
Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.
Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.
As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.
Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.
Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.
Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)
Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.
Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.

Access speed: local variable vs. array

Given this example code:
struct myStruct1 { int one, two; } first;
struct myStruct2 { myStruct1 n; int c; } second[255];
// this is a member of a structure in an array of structures!
register int i = second[2].n.one;
const int u = second[3].n.one;
while (1)
{
// do something with second[1].n.one
// do something with i
// do something with u
}
Which one is faster?
Is it correct that a local copy of an array index can be copied into a register?
Will it be even faster if the copy is done inside the loop?
Which one is faster?
The only way to know is to measure or profile. You can look at the assembly code to get a hint at which one is faster, but the truth is in the profiling.
Is it correct that a local copy of an array index can be copied into a register?
A register can hold many things. The use of registers is controlled by the compiler and the quantity of registers that the processor has available.
The compiler may put values into registers or place them on the stack. Eventually, values need to go into registers. Some processors have the ability to copy memory from one location to another without using registers. Whether or not the compiler uses these features depends on the compiler and the optimization level.
Will it be even faster if the copy is done inside the loop?
Unnecessary code in a loop slows down the loop. Compilers may factor out code that isn't changing inside the loop.
Some processors can contain all of the instructions for a loop in their instruction cache; others not. Again, all this depends on the processor and the compiler optimization settings.
Micro-optimizations
Your questions fall under the category of micro-optimizations. In general, this group of optimizations usually gains speed in terms of processor instructions. Unless you iterate over 1.0E+09 times, the optimizations won't gain you significant savings. With today's processors, were talking an average gain of 100 nanoseconds per instructions (or worst case 1 millisecond). Unless you have profiled, you don't want to waste your development effort with these optimizations.
Design, & Coding optimizations
Here is a list of optimizations that will gain more performance benefits than micro-optimizations:
Removing unwanted requirements.
Removing unused modules.
Sharing common modules.
Using efficient data structures.
Removing unnecessary work.
Performing tasks in the background.
Double buffering.
Input (blocks), Process (blocks), Output (blocks).
Reducing function calls, and comparisons.
Reducing code by simplifying using algebra or Karnough Maps.

Which algorithm brings the best performance? [duplicate]

This question already has answers here:
Why are elementwise additions much faster in separate loops than in a combined loop?
(10 answers)
What is the overhead in splitting a for-loop into multiple for-loops, if the total work inside is the same? [duplicate]
(4 answers)
Closed 9 years ago.
I have a piece of code that is really dirty.
I want to optimize it a little bit. Does it makes any difference when I take one of the following structures or are they identical with the point of view to performance in c++ ?
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
for end
....
or
for(unsigned int i = 1; i < entity.size(); ++i) begin
if
if ... else ...
if
if ... else ...
if
if ... else ...
....
for end
Thanks in Advance!
Both are O(n). As we do not know the guts of the various for loops it is impossible to say.
BTW - Mark it as pseudo code and not C++
The 1st one may spend less time incrementing/testing i and conditionally branching (assuming the compiler's optimiser doesn't reduce it to the equivalent of the second one anyway), but with loop unrolling the time taken for the i loop may be insignificant compared to the time spent within the loop anyway.
Countering that, it's easily possible that the choice of separate versus combined loops will affect the ratio of cache hits, and that could significantly impact either version: it really depends on the code. For example, if each of the three if/else statements accessed different arrays at index i, then they'll be competing for CPU cache and could slow each other down. On the other hand, if they accessed the same array at index i, doing different steps in some calculation, then it's probably better to do all three steps while those memory pages are still in cache.
There are potential impacts other than caches - from impact to register allocation, speed of I/O devices (e.g. if each loop operates on lines/records from a distinct file on different physical drives, it's very probably faster to process some of each file in a loop, rather than sequentially process each file), etc..
If you care, benchmark your actual application with representative data.
Just from the structure of the loop it is not possible to say which approach will be faster.
Algorithmically, both has the same complexity O(n). However, both might have different performance numbers depending upon the kind of operation you are performing on the elements and the size of the container.
The size of container may have an impact on locality and hence the performance. So generally speaking, you would like to chew the data as much as you can, once you get it into the cache. So I would prefer the second approach. To get a clear picture you should actually measure the performance of you approach.
The second is only slightly more efficient than the first. You save:
Initialization of loop index
Calling size()
Comparing the loop index with the size()`
Incrementing the loop index
These are very minor optimizations. Do it if it doesn't impact readability.
I would expect the second approach to be at least marginally more optimal in most cases as it can leverage the locality of reference with respect to access to elements of the entity collection/set. Note that in the first approach, each for loop would need to start accessing elements from the beginning; depending on the size of the cache, the size of the list and the extent to which compiler can infer and optimize, this may lead to cache misses when a new for loop attempts to read an element even though that element would have been read already by a preceding loop.

Comparing forward and reverse loop for int with one limit as 0

Consider the example with for loop:
for(int i = 0; i <= NUM; i++); // forward
for(int i = NUM; i >= 0; i--); // reverse
I tested this loops with gcc (linux-64). Without any optimization flag, forward loop was faster and with optimization to O3/O4, reverse loop was faster.
Somewhere I heard that due to better cache replacement techniques, forward loop is faster.
Personally I think, reverse loop should be faster (whether NUM is a constant or variable). Because any microprocessor will have single instruction for comparison with 0, i >= 0 (i.e. JLZ (jump if less than zero) and equivalent).
Is there any deterministic answer to this ?
No, there is absolutely no deterministic answer for this. You're looking at two different levels of abstraction.
C++ has absolutely nothing to say about what happens under the covers, performance-wise. It specifies a virtual machine which executes C++ code and, while it covers functionality, it does not cover performance of the underlying environment (a).
Which of those is faster will depend on a variety of factors. You may find yourself running on a CPU which makes no distinction between comparing with an arbitrary value and comparing with zero.
You may find an architecture where incrementing a register is ten times faster than decrementing one, bizarre though that may seem.
You may even find a brain-dead architecture that has no decrement, add or subtract instructions at all, and you have to emulate decrement by calling increment 2n-1 times (where n is the word size).
Bottom line: you can't presume to know what's going on under the hood unless you want to look at a very specific CPU, compiler, etc.
You should optimise your code for readability first. If you need to process things in an increasing manner, use the first option. If a decreasing manner, use the latter. If either way seems equally natural, then choose the fastest one, discovered by benchmarking or analysis of the underlying architecture and assembler code. But only do this if you have a specific performance problem, otherwise you're wasting effort.
In any case, since you're almost certainly going to be using i for something, it's likely that whatever tiny increase in performance you get by going the fastest way will be more than swamped by the fact that you now have to calculate NUM-i inside the loop (unless, of course, the compiler is smarter than the developer which, based on what I've seen from gcc, is quite possible).
(a) It does specify certain performance-related things such as the time complexity of some things in the containers library, but not specifically the thing you're asking about, whether forward loops or reverse ones are faster.
The cache replacement techniques only come in effect if there is a conflict. Perhaps NUM isn't big enough for it to have an effect, or perhaps the mapping of virtual to physical memory happens to be favorable for the cache replacement algorithm.
Trying to potentially save a single machine instruction is showing lack of trust for the compiler. If it was that easy, surely the optimizer would know that!
Maybe incrementing a loop variable is so much more common that CPU's branch prediction works better on those.
With the compiler optimization, you loop might be just unrolled—given that I correctly assume, your NUM is a #define constant—and therefore faster.
Although it doesn't really answer your question, but a thought. How about this loop:
int i = NUM + 1;
while ( i --> 0 )//it looks as if i goes to zero (like in calculus)!
{
}