Loop optimisation techniques in C++

Loop optimisation techniques in C++ - c++

In order to increase the performance of our applications, we have to consider loop optimisation techniques during the development phase.
I'd like to show you some different ways to iterate over a simple std::vector<uint32_t> v:
Unoptimized loop with index:
uint64_t sum = 0;
for (unsigned int i = 0; i < v.size(); i++)
sum += v[i];
Unoptimized loop with iterator:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it;
for (it = v.begin(); it != v.end(); it++)
sum += *it;
Cached std::vector::end iterator:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; it++)
sum += *it;
Pre-increment iterators:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; ++it)
sum += *it;
Range-based loop:
uint64_t sum = 0;
for (auto const &x : v)
sum += x;
There are also other means to build a loop in C++; for instance by using std::for_each, BOOST_FOREACH, etc...
In your opinion, which is the best approach to increase the performance and why?
Furthermore, in performance-critical applications it could be useful to unroll the loops: again, which approach would you suggest?

There's no hard and fast rule, since it depends on the
implementation. If the measures I did some years back are
typical, however: about the only thing which makes a difference
is caching the end iterator. Pre- or post-fix makes no
difference, regardless of the container and iterator type.
At the time, I didn't measure indexing (because I was comparing
iterators of different types of container as well, and not all
support indexing). But I would guess that if you use indexes,
you should cache the results of v.size() as well.
Of course, these measures were for one compiler (g++) on one
system, with a specific hardware. The only way you can know for
your environment is to measure yourself.
RE your note: are you sure you have full optimization turned on.
My measures showed no difference between 3 and 4, and I doubt
that commpilers optimize less today.
It's very important for the optimizations here that the
functions are actually inlined. If they're not,
post-incrementation does require some extra copying, and
typically will require an extra function call (to the copy
constructor of the iterator) as well. Once the functions are
inlined, however, the compiler can easily see that all this is
a unessential, and (at least when I tried it) generate exactly
the same code in both cases. (I'd use pre-incrementation
anyway. Not because it makes a difference, but because if you
don't, some idiots will come along claiming it will, despite
your measures. Or maybe they're not idiots, but are just using
a particularly stupid compiler.)
To tell the truth, when I did the measurements, I was surprised
that caching the end iterator made a difference, even for
vector, where as there was no difference between pre- and
post-incrementation, even for a reverse iterator into a map.
After all, end() was inlined as well; in fact, every single
function used in my tests was inlined.
As to unrolling the loops: I'd probably do something like this:
std::vector<uint32_t>::const_iterator current = v.begin();
std::vector<uint32_t>::const_iterator end = v.end();
switch ( (end - current) % 4 ) {
case 3:
sum += *current ++;
case 2:
sum += *current ++;
case 1:
sum += *current ++;
case 0:
}
while ( current != end ) {
sum += current[0] + current[1] + current[2] + current[3];
current += 4;
}
(This is a factor of 4. You can easily increase it if
necessary.)

I'm going on the assumption that you are well aware of the evils of premature micro-optimization, and that you have identified hotspots in your code by profiling and all the rest. I'm also going on the assumption that you're only concerned about performance with respect to speed. That is, you don't care deeply about the size of the resulting code or memory use.
The code snippets you have provided will yield largely the same results, with the exception of the cached end() iterator. Aside from caching and inlining as much as you can, there is not much you can do to tweak structure of the loops above to realize significant gains in performance.
Writing performant code in critical paths relies first and foremost on selecting the best algorithm for the job. If you have a performance problem, look first and hard at the algorithm. The compiler will generally do a much better job at micro-optimizing the code you wrote than you could ever hope to.
All that being said, there are a few things you can do to give your compiler a little help.
Cache everything you can
Keep small allocations to a minimum, especially within a loop
Make as many things const as you can. This gives the compiler additional opportunities to micro-optimize.
Learn your toolchain well and leverage that knowledge
Learn your architecture well and leverage that knowledge
Learn to read assembly code and examine the assembly output from your compiler
Learning your toolchain and architecture are going to yield the most benefits. For example, GCC has many options you can enable to increase performance, including loop unrolling. See here. When iterating datasets, it is often beneficial to keep each item aligned to the size of a cache line. In modern architecture this often means 64 bytes, but learn your architecture.
Here is an excellent guide to writing performant C++ in an Intel environment.
Once you have learned your architecture and toolchain, you might find that the algorithm you originally selected is not optimal in your real world. Be open to change in the face of new data.

It's very likely that modern compilers will produce the same assembly for the approaches you give above. You should look at the actual assembly (after enabling optimizations) to see.
When you're down to worrying about the speed of your loops, you should really think about whether your algorithm is truly optimal. If you're convinced it is, then you need to think about (and make use of) the underlying implementation of the data structures. std::vector uses an array underneath, and, depending on the compiler and the other code in the function, pointer aliasing may prevent the compiler from fully optimizing your code.
There's a fair amount of information out there on pointer aliasing (including What is the strict aliasing rule?), but Mike Acton has some wonderful information about pointer aliasing.
The restrict keyword (see What does the restrict keyword mean in C++? or, again, Mike Acton), available through compiler extensions for many years and codified in C99 (currently only available as a compiler extension in C++), is meant to deal with this. The way to use this in your code is far more C-like, but may allow the compiler to better optimize your loop, at least for the examples you've given:
uint64_t sum = 0;
uint32_t *restrict velt = &v[0];
uint32_t *restrict vend = velt + v.size();
while(velt < vend) {
sum += *velt;
velt++;
}
However, to see whether this provides a difference, you really need to profile different approaches for your actual, real-life problem, and possibly look at the underlying assembly produced. If you're summing simple data types, this may help you. If you're doing anything more complicated, including calling a function that cannot be inlined in the loop, it's unlikely to make any different at all.

If you're using clang, then pass it these flags:
-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize
In Visual C++ add this to the build:
/Qvec-report:2
These flags will tell you if a loop fails to vectorise (and give you an often cryptic message explaining why).
In general though, prefer options 4 and 5 (or std::for_each). Whilst clang and gcc will typically do a decent job in most cases, Visual C++ tends to err on the side of caution sadly. If the scope of the variable is unknown (e.g. a reference or pointer passed into a function, or this pointer), then vectorisation often fails (containers in the local scope will almost always vectorise).
#include <vector>
#include <cmath>
// fails to vectorise in Visual C++ and /O2
void func1(std::vector<float>& v)
{
for(size_t i = 0; i < v.size(); ++i)
{
v[i] = std::sqrt(v[i]);
}
}
// this will vectorise with sqrtps
void func2(std::vector<float>& v)
{
for(std::vector<float>::iterator it = v.begin(), e = v.end(); it != e; ++it)
{
*it = std::sqrt(*it);
}
}
Clang and gcc aren't immune to these issues either. If you always take a copy of begin/end, then it cannot be a problem.
Here's another classic that affects many compilers sadly (clang 3.5.0 fails this trivial test, but it's fixed in clang 4.0). It crops up a LOT!
struct Foo
{
void func3();
void func4();
std::vector<float> v;
float f;
};
// will not vectorise
void Foo::func3()
{
// this->v.end() !!
for(std::vector<float>::iterator it = v.begin(); it != v.end(); ++it)
{
*it *= f; // this->f !!
}
}
void Foo::func4()
{
// you need to take a local copy of v.end(), and 'f'.
const float temp = f;
for(std::vector<float>::iterator it = v.begin(), e = v.end(); it != e; ++it)
{
*it *= temp;
}
}
In the end, if it's something you care about, use the vectorisation reports from the compiler to fix up your code. As mentioned above, this is basically an issue of pointer aliasing. You can use the restrict keyword to help fix some of these issues (but I've found that applying restrict to 'this' is often not that useful).

Use range based for by default as it will give the compiler the most direct information to optimize (compiler knows it can cache the end iterator for example). Then profile and only optimize further if you identify a significant bottleneck. There will be very few real world situations where these different loop variants make a meaningful performance difference. Compilers are pretty good at loop optimization and it is far more likely that you should focus your optimization effort elsewhere (like choosing a better algorithm or focusing on optimizing the loop body).

Related

Is there a technical reason to use > (<) instead of != when incrementing by 1 in a 'for' loop?

I almost never see a for loop like this:
for (int i = 0; 5 != i; ++i)
{}
Is there a technical reason to use > or < instead of != when incrementing by 1 in a for loop? Or this is more of a convention?

while (time != 6:30pm) {
Work();
}
It is 6:31pm... Damn, now my next chance to go home is tomorrow! :)
This to show that the stronger restriction mitigates risks and is probably more intuitive to understand.

There is no technical reason. But there is mitigation of risk, maintainability and better understanding of code.
< or > are stronger restrictions than != and fulfill the exact same purpose in most cases (I'd even say in all practical cases).
There is duplicate question here; and one interesting answer.

Yes there is a reason. If you write a (plain old index based) for loop like this
for (int i = a; i < b; ++i){}
then it works as expected for any values of a and b (ie zero iterations when a > b instead of infinite if you had used i == b;).
On the other hand, for iterators you'd write
for (auto it = begin; it != end; ++it)
because any iterator should implement an operator!=, but not for every iterator it is possible to provide an operator<.
Also range-based for loops
for (auto e : v)
are not just fancy sugar, but they measurably reduce the chances to write wrong code.

You can have something like
for(int i = 0; i<5; ++i){
...
if(...) i++;
...
}
If your loop variable is written by the inner code, the i!=5 might not break that loop. This is safer to check for inequality.
Edit about readability.
The inequality form is way more frequently used. Therefore, this is very fast to read as there is nothing special to understand (brain load is reduced because the task is common). So it's cool for the readers to make use of these habits.

And last but not least, this is called defensive programming, meaning to always take the strongest case to avoid current and future errors influencing the program.
The only case where defensive programming is not needed is where states have been proven by pre- and post-conditions (but then, proving this is the most defensive of all programming).

I would argue that an expression like
for ( int i = 0 ; i < 100 ; ++i )
{
...
}
is more expressive of intent than is
for ( int i = 0 ; i != 100 ; ++i )
{
...
}
The former clearly calls out that the condition is a test for an exclusive upper bound on a range; the latter is a binary test of an exit condition. And if the body of the loop is non-trivial, it may not apparent that the index is only modified in the for statement itself.

Iterators are an important case when you most often use the != notation:
for(auto it = vector.begin(); it != vector.end(); ++it) {
// do stuff
}
Granted: in practice I would write the same relying on a range-for:
for(auto & item : vector) {
// do stuff
}
but the point remains: one normally compares iterators using == or !=.

The loop condition is an enforced loop invariant.
Suppose you don't look at the body of the loop:
for (int i = 0; i != 5; ++i)
{
// ?
}
in this case, you know at the start of the loop iteration that i does not equal 5.
for (int i = 0; i < 5; ++i)
{
// ?
}
in this case, you know at the start of the loop iteration that i is less than 5.
The second is much, much more information than the first, no? Now, the programmer intent is (almost certainly) the same, but if you are looking for bugs, having confidence from reading a line of code is a good thing. And the second enforces that invariant, which means some bugs that would bite you in the first case just cannot happen (or don't cause memory corruption, say) in the second case.
You know more about the state of the program, from reading less code, with < than with !=. And on modern CPUs, they take the same amount of time as no difference.
If your i was not manipulated in the loop body, and it was always increased by 1, and it started less than 5, there would be no difference. But in order to know if it was manipulated, you'd have to confirm each of these facts.
Some of these facts are relatively easy, but you can get wrong. Checking the entire body of the loop is, however, a pain.
In C++ you can write an indexes type such that:
for( const int i : indexes(0, 5) )
{
// ?
}
does the same thing as either of the two above for loops, even down to the compiler optimizing it down to the same code. Here, however, you know that i cannot be manipulated in the body of the loop, as it is declared const, without the code corrupting memory.
The more information you can get out of a line of code without having to understand the context, the easier it is to track down what is going wrong. < in the case of integer loops gives you more information about the state of the code at that line than != does.

As already said by Ian Newson, you can't reliably loop over a floating variable and exit with !=. For instance,
for (double x=0; x!=1; x+=0.1) {}
will actually loop forever, because 0.1 can't exactly be represented in floating point, hence the counter narrowly misses 1. With < it terminates.
(Note however that it's basically undefined behaviour whether you get 0.9999... as the last accepted number – which kind of violates the less-than assumption – or already exit at 1.0000000000000001.)

Yes; OpenMP doesn't parallelize loops with the != condition.

It may happen that the variable i is set to some large value and if you just use the != operator you will end up in an endless loop.

As you can see from the other numerous answers, there are reasons to use < instead of != which will help in edge cases, initial conditions, unintended loop counter modification, etc...
Honestly though, I don't think you can stress the importance of convention enough. For this example it will be easy enough for other programmers to see what you are trying to do, but it will cause a double-take. One of the jobs while programming is making it as readable and familiar to everyone as possible, so inevitably when someone has to update/change your code, it doesn't take a lot of effort to figure out what you were doing in different code blocks. If I saw someone use !=, I'd assume there was a reason they used it instead of < and if it was a large loop I'd look through the whole thing trying to figure out what you did that made that necessary... and that's wasted time.

I take the adjectival "technical" to mean language behavior/quirks and compiler side effects such as performance of generated code.
To this end, the answer is: no(*). The (*) is "please consult your processor manual". If you are working with some edge-case RISC or FPGA system, you may need to check what instructions are generated and what they cost. But if you're using pretty much any conventional modern architecture, then there is no significant processor level difference in cost between lt, eq, ne and gt.
If you are using an edge case you could find that != requires three operations (cmp, not, beq) vs two (cmp, blt xtr myo). Again, RTM in that case.
For the most part, the reasons are defensive/hardening, especially when working with pointers or complex loops. Consider
// highly contrived example
size_t count_chars(char c, const char* str, size_t len) {
size_t count = 0;
bool quoted = false;
const char* p = str;
while (p != str + len) {
if (*p == '"') {
quote = !quote;
++p;
}
if (*(p++) == c && !quoted)
++count;
}
return count;
}
A less contrived example would be where you are using return values to perform increments, accepting data from a user:
#include <iostream>
int main() {
size_t len = 5, step;
for (size_t i = 0; i != len; ) {
std::cout << "i = " << i << ", step? " << std::flush;
std::cin >> step;
i += step; // here for emphasis, it could go in the for(;;)
}
}
Try this and input the values 1, 2, 10, 999.
You could prevent this:
#include <iostream>
int main() {
size_t len = 5, step;
for (size_t i = 0; i != len; ) {
std::cout << "i = " << i << ", step? " << std::flush;
std::cin >> step;
if (step + i > len)
std::cout << "too much.\n";
else
i += step;
}
}
But what you probably wanted was
#include <iostream>
int main() {
size_t len = 5, step;
for (size_t i = 0; i < len; ) {
std::cout << "i = " << i << ", step? " << std::flush;
std::cin >> step;
i += step;
}
}
There is also something of a convention bias towards <, because ordering in standard containers often relies on operator<, for instance hashing in several STL containers determines equality by saying
if (lhs < rhs) // T.operator <
lessthan
else if (rhs < lhs) // T.operator < again
greaterthan
else
equal
If lhs and rhs are a user defined class writing this code as
if (lhs < rhs) // requires T.operator<
lessthan
else if (lhs > rhs) // requires T.operator>
greaterthan
else
equal
The implementor has to provide two comparison functions. So < has become the favored operator.

There are several ways to write any kind of code (usually), there just happens to be two ways in this case (three if you count <= and >=).
In this case, people prefer > and < to make sure that even if something unexpected happens in the loop (like a bug), it won't loop infinitely (BAD). Consider the following code, for example.
for (int i = 1; i != 3; i++) {
//More Code
i = 5; //OOPS! MISTAKE!
//More Code
}
If we used (i < 3), we would be safe from an infinite loop because it placed a bigger restriction.
Its really your choice whether you want a mistake in your program to shut the whole thing down or keep functioning with the bug there.
Hope this helped!

The most common reason to use < is convention. More programmers think of loops like this as "while the index is in range" rather than "until the index reaches the end." There's value is sticking to convention when you can.
On the other hand, many answers here are claiming that using the < form helps avoid bugs. I'd argue that in many cases this just helps hide bugs. If the loop index is supposed to reach the end value, and, instead, it actually goes beyond it, then there's something happening you didn't expect which may cause a malfunction (or be a side effect of another bug). The < will likely delay discovery of the bug. The != is more likely to lead to a stall, hang, or even a crash, which will help you spot the bug sooner. The sooner a bug is found, the cheaper it is to fix.
Note that this convention is peculiar to array and vector indexing. When traversing nearly any other type of data structure, you'd use an iterator (or pointer) and check directly for an end value. In those cases you have to be sure the iterator will reach and not overshoot the actual end value.
For example, if you're stepping through a plain C string, it's generally more common to write:
for (char *p = foo; *p != '\0'; ++p) {
// do something with *p
}
than
int length = strlen(foo);
for (int i = 0; i < length; ++i) {
// do something with foo[i]
}
For one thing, if the string is very long, the second form will be slower because the strlen is another pass through the string.
With a C++ std::string, you'd use a range-based for loop, a standard algorithm, or iterators, even if though the length is readily available. If you're using iterators, the convention is to use != rather than <, as in:
for (auto it = foo.begin(); it != foo.end(); ++it) { ... }
Similarly, iterating a tree or a list or a deque usually involves watching for a null pointer or other sentinel rather than checking if an index remains within a range.

One reason not to use this construct is floating point numbers. != is a very dangerous comparison to use with floats as it'll rarely evaluate to true even if the numbers look the same. < or > removes this risk.

There are two related reasons for following this practice that both have to do with the fact that a programming language is, after all, a language that will be read by humans (among others).
(1) A bit of redundancy. In natural language we usually provide more information than is strictly necessary, much like an error correcting code. Here the extra information is that the loop variable i (see how I used redundancy here? If you didn't know what 'loop variable' means, or if you forgot the name of the variable, after reading "loop variable i" you have the full information) is less than 5 during the loop, not just different from 5. Redundancy enhances readability.
(2) Convention. Languages have specific standard ways of expressing certain situations. If you don't follow the established way of saying something, you will still be understood, but the effort for the recipient of your message is greater because certain optimisations won't work. Example:
Don't talk around the hot mash. Just illuminate the difficulty!
The first sentence is a literal translation of a German idiom. The second is a common English idiom with the main words replaced by synonyms. The result is comprehensible but takes a lot longer to understand than this:
Don't beat around the bush. Just explain the problem!
This is true even in case the synonyms used in the first version happen to fit the situation better than the conventional words in the English idiom. Similar forces are in effect when programmers read code. This is also why 5 != i and 5 > i are weird ways of putting it unless you are working in an environment in which it is standard to swap the more normal i != 5 and i < 5 in this way. Such dialect communities do exist, probably because consistency makes it easier to remember to write 5 == i instead of the natural but error prone i == 5.

Using relational comparisons in such cases is more of a popular habit than anything else. It gained its popularity back in the times when such conceptual considerations as iterator categories and their comparability were not considered high priority.
I'd say that one should prefer to use equality comparisons instead of relational comparisons whenever possible, since equality comparisons impose less requirements on the values being compared. Being EqualityComparable is a lesser requirement than being LessThanComparable.
Another example that demonstrates the wider applicability of equality comparison in such contexts is the popular conundrum with implementing unsigned iteration down to 0. It can be done as
for (unsigned i = 42; i != -1; --i)
...
Note that the above is equally applicable to both signed and unsigned iteration, while the relational version breaks down with unsigned types.

Besides the examples, where the loop variable will (unintentional) change inside the body, there are other reasions to use the smaller-than or greater-than operators:
Negations make code harder to understand
< or > is only one char, but != two

In addition to the various people who have mentioned that it mitigates risk, it also reduces the number of function overloads necessary to interact with various standard library components. As an example, if you want your type to be storable in a std::set, or used as a key for std::map, or used with some of the searching and sorting algorithms, the standard library usually uses std::less to compare objects as most algorithms only need a strict weak ordering. Thus it becomes a good habit to use the < comparisons instead of != comparisons (where it makes sense, of course).

There is no problem from a syntax perspective, but the logic behind that expression 5!=i is not sound.
In my opinion, using != to set the bounds of a for loop is not logically sound because a for loop either increments or decrements the iteration index, so setting the loop to iterate until the iteration index becomes out of bounds (!= to something) is not a proper implementation.
It will work, but it is prone to misbehavior since the boundary data handling is lost when using != for an incremental problem (meaning that you know from the start if it increments or decrements), that's why instead of != the <>>==> are used.

The simple task of iterating through an array. Which of these solutions is the most efficient?

Recently, I've been thinking about all the ways that one could iterate through an array and wondered which of these is the most (and least) efficient. I've written a hypothetical problem and five possible solutions.
Problem
Given an int array arr with len number of elements, what would be the most efficient way of assigning an arbitrary number 42 to every element?
Solution 0: The Obvious
for (unsigned i = 0; i < len; ++i)
arr[i] = 42;
Solution 1: The Obvious in Reverse
for (unsigned i = len - 1; i >= 0; --i)
arr[i] = 42;
Solution 2: Address and Iterator
for (unsigned i = 0; i < len; ++i)
{ *arr = 42;
++arr;
}
Solution 3: Address and Iterator in Reverse
for (unsigned i = len; i; --i)
{ *arr = 42;
++arr;
}
Solution 4: Address Madness
int* end = arr + len;
for (; arr < end; ++arr)
*arr = 42;
Conjecture
The obvious solutions are almost always used, but I wonder whether the subscript operator could result in a multiplication instruction, as if it had been written like *(arr + i * sizeof(int)) = 42.
The reverse solutions try to take advantage of how comparing i to 0 instead of len might mitigate a subtraction operation. Because of this, I prefer Solution 3 over Solution 2. Also, I've read that arrays are optimized to be accessed forwards because of how they're stored in the cache, which could present an issue with Solution 1.
I don't see why Solution 4 would be any less efficient than Solution 2. Solution 2 increments the address and the iterator, while Solution 4 only increments the address.
In the end, I'm not sure which of these solutions I prefer. I'm think the answer also varies with the target architecture and optimization settings of your compiler.
Which of these do you prefer, if any?

Just use std::fill.
std::fill(arr, arr + len, 42);
Out of your proposed solutions, on a good compiler, neither should be faster than the others.

The ISO standard doesn't mandate the efficiency of the different ways of doing things in code (other than certain big-O type stuff for some collection algorithms), it simply mandates how it functions.
Unless your arrays are billions of elements in size, or you're wanting to set them millions of times per minute, it generally won't make the slightest difference which method you use.
If you really want to know (and I still maintain it's almost certainly unnecessary), you should benchmark the various methods in the target environment. Measure, don't guess!
As to which I prefer, my first inclination is to optimise for readability. Only if there's a specific performance problem do I then consider other possibilities. That would be simply something like:
for (size_t idx = 0; idx < len; idx++)
arr[idx] = 42;

I don't think that performance is an issue here - those are, if at all (I could imagine the compiler producing the identical assembly for most of them), micro optimizations hardly ever necessary.
Go with the solution that is most readable; the standard library provides you with std::fill, or for more complex assignments
for(unsigned k = 0; k < len; ++k)
{
// whatever
}
so it is obvious to other people looking at your code what you are doing. With C++11 you could also
for(auto & elem : arr)
{
// whatever
}
just don't try to obfuscate your code without any necessity.

For nearly all meaningful cases, the compiler will optimize all of the suggested ones to the same thing, and it's very unlikely to make any difference.
There used to be a trick where you could avoid the automatic prefetching of data if you ran the loop backwards, which under some bizarre set of circumstances actually made it more efficient. I can't recall the exact circumstances, but I expect modern processors will identify backwards loops as well as forwards loops for automatic prefetching anyway.
If it's REALLY important for your application to do this over a large number of elements, then looking at blocked access and using non-temporal storage will be the most efficient. But before you do that, make sure you have identified the filling of the array as an important performance point, and then make measurements for the current code and the improved code.
I may come back with some actual benchmarks to prove that "it makes little difference" in a bit, but I've got an errand to run before it gets too late in the day...

C++ loop unrolling for compile time constant small values

I have these 2 functions:
template<int N>
void fun()
{
for(int i = 0; i < N; ++i)
{
std::cout<<i<<" ";
}
}
void gun(int N)
{
for(int i = 0; i < N; ++i)
{
std::cout<<i<<" ";
}
}
May I assume that in the first version the compiler will optimize the loop for every small N(by small I mean N = {1, 2, 3, 4})?

May I assume that in the first version the compiler will optimize the loop for every small N
That is a typical optimization, although "assume" is a strong word. If an optimization is imperative you will eventually be disappointed by any potential optimization.
Your second version may experience the same optimization if the compiler is able to inline the function.

You never have any guarantees as to what the optimization will do, but given a suitable optimization level, you can usually rely on it making better choices than you would if optimizing manually.
if you really want to know what code is produced, you can always take a look at the resulting assembly.

If the compiler can inline either of the functions, it will also unroll the loop if it thinks that's the right thing to do. When & how a compiler decides there is a benefit in unrolling a loop is quite a complex matter, and depends highly on other factors, such as the number of available registers, what happens inside the loop (I doubt the example given above, for example, would gain much in time from reducing the 5 or so instructions involved in the loop, given that cout ... will probably consume several thousand times as much time - whether the compiler can figure that out or not is another matter, but it isn't entirely unknown for compilers to have SOME understanding of whether a function is small or not.
On the other hand, if the code looks something like this:
int arr[N]; // Global array.
template<int N>
int fun()
{
int sum = 0;
for(int i = 0; i < N; ++i)
{
sum += arr[i];
}
}
Then I would expect the compiler to unroll the loop to be something like this:
int *tmp = arr;
sum += *tmp++;
sum += *tmp++;
sum += *tmp++;
sum += *tmp++;
sum += *tmp++;
Assuming N = 5.
And this applies to any function that is "visible" to the compiler and where N is known at compile-time. So, assuming gun isn't in a different source file, then I would expect it to be inlined and unrolled exactly the same as fun (which, being a template function, HAS to be visible in this compile unit)

It depends on your optimization level and flags. There's a big difference between -O0 -g (no optimization, debugging enabled), -O3 (aggresively optimize for speed), and -Os (optimize for space).
These days loop unrolling isn't necessarily a win, even when optimizing for speed. Too much code can cause an instruction cache miss which will greatly outweigh the speedup of inlining a simple loop. And the cost of the conditional branch in a loop like this is almost negligible since branch prediction will correctly anticipate all but the last iteration.

If you want to be a little more explicit, you can use Duff's Device which uses switch case fallthrough to unroll loops. I can't speak to how well it works in practice, though. I would imagine, however, that if you can hint to the compiler to unroll it instead, that would be faster.
Compilers are also pretty smart, and while they're not infallible their optimization choices are generally better than our own intuition.

Which one is more optimized for accessing array?

Solving the following exercise:
Write three different versions of a program to print the elements of
ia. One version should use a range for to manage the iteration, the
other two should use an ordinary for loop in one case using subscripts
and in the other using pointers. In all three programs write all the
types directly. That is, do not use a type alias, auto, or decltype to
simplify the code.[C++ Primer]
a question came up: Which of these methods for accessing array is optimized in terms of speed and why?
My Solutions:
Foreach Loop:
int ia[3][4]={{1,2,3,4},{5,6,7,8},{9,10,11,12}};
for (int (&i)[4]:ia) //1st method using for each loop
for(int j:i)
cout<<j<<" ";
Nested for loops:
for (int i=0;i<3;i++) //2nd method normal for loop
for(int j=0;j<4;j++)
cout<<ia[i][j]<<" ";
Using pointers:
int (*i)[4]=ia;
for(int t=0;t<3;i++,t++){ //3rd method. using pointers.
for(int x=0;x<4;x++)
cout<<(*i)[x]<<" ";
Using auto:
for(auto &i:ia) //4th one using auto but I think it is similar to 1st.
for(auto j:i)
cout<<j<<" ";
Benchmark result using clock()
1st: 3.6 (6,4,4,3,2,3)
2nd: 3.3 (6,3,4,2,3,2)
3rd: 3.1 (4,2,4,2,3,4)
4th: 3.6 (4,2,4,5,3,4)
Simulating each method 1000 times:
1st: 2.29375 2nd: 2.17592 3rd: 2.14383 4th: 2.33333
Process returned 0 (0x0) execution time : 13.568 s
Compiler used:MingW 3.2 c++11 flag enabled. IDE:CodeBlocks

I have some observations and points to make and I hope you get your answer from this.
The fourth version, as you mention yourself, is basically the same as the first version. auto can be thought of as only a coding shortcut (this is of course not strictly true, as using auto can result in getting different types than you'd expected and therefore result in different runtime behavior. But most of the time this is true.)
Your solution using pointers is probably not what people mean when they say that they are using pointers! One solution might be something like this:
for (int i = 0, *p = &(ia[0][0]); i < 3 * 4; ++i, ++p)
cout << *p << " ";
or to use two nested loops (which is probably pointless):
for (int i = 0, *p = &(ia[0][0]); i < 3; ++i)
for (int j = 0; j < 4; ++j, ++p)
cout << *p << " ";
from now on, I'm assuming this is the pointer solution you've written.
In such a trivial case as this, the part that will absolutely dominate your running time is the cout. The time spent in bookkeeping and checks for the loop(s) will be completely negligible comparing to doing I/O. Therefore, it won't matter which loop technique you use.
Modern compilers are great at optimizing such ubiquitous tasks and access patterns (iterating over an array.) Therefore, chances are that all these methods will generate exactly the same code (with the possible exception of the pointer version, which I will talk about later.)
The performance of most codes like this will depend more on the memory access pattern rather than how exactly the compiler generates the assembly branch instructions (and the rest of the operations.) This is because if a required memory block is not in the CPU cache, it's going to take a time roughly equivalent of several hundred CPU cycles (this is just a ballpark number) to fetch those bytes from RAM. Since all the examples access memory in exactly the same order, their behavior in respect to memory and cache will be the same and will have roughly the same running time.
As a side note, the way these examples access memory is the best way for it to be accessed! Linear, consecutive and from start to finish. Again, there are problems with the cout in there, which can be a very complicated operation and even call into the OS on every invocation, which might result, among other things, an almost complete deletion (eviction) of everything useful from the CPU cache.
On 32-bit systems and programs, the size of an int and a pointer are usually equal (both are 32 bits!) Which means that it doesn't matter much whether you pass around and use index values or pointers into arrays. On 64-bit systems however, a pointer is 64 bits but an int will still usually be 32 bits. This suggests that it is usually better to use indexes into arrays instead of pointers (or even iterators) on 64-bit systems and programs.
In this particular example, this is not significant at all though.
Your code is very specific and simple, but the general case, it is almost always better to give as much information to the compiler about your code as possible. This means that you must use the narrowest, most specific device available to you to do a job. This in turn means that a generic for loop (i.e. for (int i = 0; i < n; ++i)) is worse than a range-based for loop (i.e. for (auto i : v)) for the compiler, because in the latter case the compiler simply knows that you are going to iterate over the whole range and not go outside of it or break out of the loop or something, while in the generic for loop case, specially if your code is more complex, the compiler cannot be sure of this and has to insert extra checks and tests to make sure the code executes as the C++ standard says it should.
In many (most?) cases, although you might think performance matters, it does not. And most of the time you rewrite something to gain performance, you don't gain much. And most of the time the performance gain you get is not worth the loss in readability and maintainability that you sustain. So, design your code and data structures right (and keep performance in mind) but avoid this kind of "micro-optimization" because it's almost always not worth it and even harms the quality of the code too.
Generally, performance in terms of speed is very hard to reason about. Ideally you have to measure the time with real data on real hardware in real working conditions using sound scientific measuring and statistical methods. Even measuring the time it takes a piece of code to run is not at all trivial. Measuring performance is hard, and reasoning about it is harder, but these days it is the only way of recognizing bottlenecks and optimizing the code.
I hope I have answered your question.
EDIT: I wrote a very simple benchmark for what you are trying to do. The code is here. It's written for Windows and should be compilable on Visual Studio 2012 (because of the range-based for loops.) And here are the timing results:
Simple iteration (nested loops): min:0.002140, avg:0.002160, max:0.002739
Simple iteration (one loop): min:0.002140, avg:0.002160, max:0.002625
Pointer iteration (one loop): min:0.002140, avg:0.002160, max:0.003149
Range-based for (nested loops): min:0.002140, avg:0.002159, max:0.002862
Range(const ref)(nested loops): min:0.002140, avg:0.002155, max:0.002906
The relevant numbers are the "min" times (over 2000 runs of each test, for 1000x1000 arrays.) As you see, there is absolutely no difference between the tests. Note that you should turn on compiler optimizations or test 2 will be a disaster and cases 4 and 5 will be a little worse than 1 and 3.
And here are the code for the tests:
// 1. Simple iteration (nested loops)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows; ++i)
for (unsigned j = 0; j < gc_Cols; ++j)
sum += g_Data[i][j];
// 2. Simple iteration (one loop)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += g_Data[i / gc_Cols][i % gc_Cols];
// 3. Pointer iteration (one loop)
unsigned sum = 0;
unsigned * p = &(g_Data[0][0]);
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += *p++;
// 4. Range-based for (nested loops)
unsigned sum = 0;
for (auto & i : g_Data)
for (auto j : i)
sum += j;
// 5. Range(const ref)(nested loops)
unsigned sum = 0;
for (auto const & i : g_Data)
for (auto const & j : i)
sum += j;

It has many factors affecting it:
It depends on the compiler
It depends on the compiler flags used
It depends on the computer used
There is only one way to know the exact answer: measuring the time used when dealing with huge arrays (maybe from a random number generator) which is the same method you have already done except that the array size should be at least 1000x1000.

Are std::fill, std::copy specialized for std::vector<bool>?

When thinking about this question I start to wondering if std::copy() and/or std::fill are specialized (I really mean optimized) for std::vector<bool>.
Is this required by C++ standard or, perhaps, it is common approach by C++ std library vendors?
Simple speaking, I wonder to know if the following code:
std::vector<bool> v(10, false);
std::fill(v.begin(), v.end(), true);
is in any way better/different than that:
std::vector<bool> v(10, false);
for (auto it = v.begin(); it != v.end(); ++it) *it = true;
To be very strict - can, let say: std::fill<std::vector<bool>::iterator>() go into internal representation of std::vector<bool> and sets their entire bytes instead of single bits? I assume making std::fill friend of std::vector<bool> is not a big problem for library vendor?
[UPDATE]
Next related question: can I (or anybody else :) specialize such algorithms for let say std::vector<bool>, if not already specialized? Is this allowed by C++ standard? I know this will be non portable - but just for one selected std C++ library? Assuming I (or anybody else) find a way to get to std::vector<bool> private parts.

STD is headers only library and it is shipped with your compiler. You can look into those headers yourself. For GCC's vector<bool> impelemtation is in stl_bvector.h. It probably will be the same file for other compilers too. And yes, there is specialized fill (look near __fill_bvector).

Optimizations are nowhere mandated in the standard. It is assumed to be a "quality of implementation" issue if an optimization could applied. The asymptotic complexity of most algorithms is, however, restricted.
Optimizations are allowed as long as a correct program behaves according to what the standard mandates. The examples you ask about, i.e., optimizations involving standard algorithms using iterators on std::vector<bool>, can achieve their objective pretty much in any way the implementation sees fit because there is no way to monitor how they are implemented. This said, I doubt very much that there is any standard library implementation optimizing operations on std::vector<bool>. Most people seem to think that this specialization is an abomination in the first place and that it should go away.
A user is only allowed to create specializations of library types if the specialization involves at least one user defined type. I don't think a user is allowed to provide any function in namespace std at all: There isn't any needs because all such functions would involve a user defined type and would, thus, be found in the user's namespace. Formulated differently: I think you are out of luck with respect to getting algoritms optimized for std::vector<bool> for the time being. You might consider contributing optimized versions to the open source implementations (e.g., libstdc++ and libc++), however.

There is no specialization for it, but you can still use it. (even though it's slow)
But here is a trick I found which enables std::fill on std::vector<bool>, using proxy class std::_Vbase.
(WARNING: I've tested it only for MSVC2013, so it may not work on other compilers.)
int num_bits = 100000;
std::vector<bool> bit_set(num_bits , true);
int bitsize_elem = sizeof(std::_Vbase) * 8; // 1byte = 8bits
int num_elems = static_cast<int>(std::ceil(num_bits / static_cast<double>(bitsize_elem)));
Here, since you need the whole bits of an element if you use any bit of it, the number of elements must be rounded up.
Using this information, we will build a vector of pointers that pointing the original elements underlying the bits.
std::vector<std::_Vbase*> elem_ptrs(num_elems, nullptr);
std::vector<bool>::iterator bitset_iter = bit_set.begin();
for (int i = 0; i < num_elems; ++i)
{
std::_Vbase* elem_ptr = const_cast<std::_Vbase*>((*bitset_iter)._Myptr);
elem_ptrs[i] = elem_ptr;
std::advance(bitset_iter, bitsize_elem);
}
(*bitset_iter)._Myptr : By dereferencing the iterator of std::vector<bool>, you can access the proxy class reference and its member _Myptr.
Since the return type of std::vector<bool>::iterator::operator*() is const std::_Vbase*, remove the constness of it by const_cast.
Now we get the pointer which is pointing the original element underlying those bits, std::_Vbase* elem_ptr.
elem_ptrs[i] = elem_ptr : Record this pointer,...
std::advance(bitset_iter, bitsize_elem) : ...and continue our journey to find the next element, by jumping bits held by the previous element.
std::fill(elem_ptrs[0], elem_ptrs[0] + num_elems, 0); // fill every bits "false"
std::fill(elem_ptrs[0], elem_ptrs[0] + num_elems, -1); // fill every bits "true"
Now, we can use std::fill on the vector of pointers, rather than vector of bits.
Perhaps some may feel uncomfortable using the proxy class externally and even remove the constness of it.
But if you don't care about that and want something fast, this is the fastest way.
I did some comparisons below. (made new project, nothing changed config, release, x64)
int it_max = 10; // do it 10 times ...
int num_bits = std::numeric_limits<int>::max(); // 2147483647
std::vector<bool> bit_set(num_bits, true);
for (int it_count = 0; it_count < it_max; ++it_count)
{
std::fill(elem_ptrs[0], elem_ptrs[0] + num_elems, 0);
} // Elapse Time : 0.397sec
for (int it_count = 0; it_count < it_max; ++it_count)
{
std::fill(bit_set.begin(), bit_set.end(), false);
} // Elapse Time : 18.734sec
for (int it_count = 0; it_count < it_max; ++it_count)
{
for (int i = 0; i < num_bits; ++i)
{
bit_set[i] = false;
}
} // Elapse Time : 21.498sec
for (int it_count = 0; it_count < it_max; ++it_count)
{
bit_set.assign(num_bits, false);
} // Elapse Time : 21.779sec
for (int it_count = 0; it_count < it_max; ++it_count)
{
bit_set.swap(std::vector<bool>(num_bits, false)); // You can not use elem_ptrs anymore
} // Elapse Time : 1.3sec
There is one caveat. When you swap() the original vector with another one, then the vector of pointers becomes useless!

23.2.5 Class vector from the C++ International Standard goes as far as to tell us
To optimize space allocation, a specialization of vector for bool elements is provided:
after which the bitset specialization is provided. That's as far as the standard goes regarding vector<bool>, vendors need to implement it using a bitset to optimize for space. Optimizing for space comes with a cost here, as to not optimize for speed.
It's easier to get a book from the library than it is to find a book if it were between all the library books stapled closely together in containers....
Take your example, you're trying to do a std::fill or std::copy from begin to end. But that's not always the case, sometimes it doen't just simply map to an entire byte. So, that's a bit of a problem in terms of speed optimization. It's easy for the case you'd have to change every bit to one, that's just changing the bytes to 0xF, but that's not the case here; it becomes much harder if you were to only changes certain bits of a byte. Then you'll need to actually compute what the byte will be; that's not a trivial thing to do*, or at least not as an atomic operation on current hardware.
It's the premature optimization story, it's nice in terms of space but horrible in terms of performance.
Is having a "is a multiple of 8 bits" check worth the overhead? I doubt it.
* We're talking about multiple bits here, for the case it's just one bit you can of course do a bit operation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js