Code reordering due to optimization - c++

I've heard so many times that an optimizer may reorder your code that I'm starting to believe it.
Are there any examples or typical cases where this might happen and how can I Avoid such a thing (eg I want a benchmark to be impervious to this)?

There are LOTS of different kinds of "code-motion" (moving code around), and it's caused by lots of different parts of the optimisation process:
move these instructions around, because it's a waste of time to wait for the memory read to complete without putting at least one or two instructions between the memory read and the operation using the content we got from memory
Move things out of loops, because it only needs to happen once (if you call x = sin(y) once or 1000 times without changing y, x will have the same value, so no point in doing that inside a loop. So compiler moves it out.
Move code around based on "compiler expects this code to hit more often than the other bit, so better cache-hit ratio if we do it this way" - for example error handling being moved away from the source of the error, because it's unlikely that you get an error [compilers often understand commonly used functions and that they typically result in success].
Inlining - code is moved from the actual function into the calling function. This often leads to OTHER effects such as reduction in pushing/poping registers from the stack and arguments can be kept where they are, rather than having to move them to the "right place for arguments".
I'm sure I've missed some cases in the above list, but this is certainly some of the most common.
The compiler is perfectly within its rights to do this, as long as it doesn't have any "observable difference" (other than the time it takes to run and the number of instructions used - those "don't count" in observable differences when it comes to compilers)
There is very little you can do to avoid compiler from reordering your code - you can write code that ensures the order to some degree. So for example, we can have code like this:
{
int sum = 0;
for(i = 0; i < large_number; i++)
sum += i;
}
Now, since sum isn't being used, the compiler can remove it. Adding some code that checks prints the sum would ensure that it's "used" according to the compiler.
Likewise:
for(i = 0; i < large_number; i++)
{
do_stuff();
}
if the compiler can figure out that do_stuff doesn't actually change any global value, or similar, it will move code around to form this:
do_stuff();
for(i = 0; i < large_number; i++)
{
}
The compiler may also remove - in fact almost certainly will - the, now, empty loop so that it doesn't exist at all. [As mentioned in the comments: If do_stuff doesn't actually change anything outside itself, it may also be removed, but the example I had in mind is where do_stuff produces a result, but the result is the same each time]
(The above happens if you remove the printout of results in the Dhrystone benchmark for example, since some of the loops calculate values that are never used other than in the printout - this can lead to benchmark results that exceed the highest theoretical throughput of the processor by a factor of 10 or so - because the benchmark assumes the instructions necessary for the loop were actually there, and says it took X nominal operations to execute each iteration)
There is no easy way to ensure this doesn't happen, aside from ensuring that do_stuff either updates some variable outside the function, or returns a value that is "used" (e.g. summing up, or something).
Another example of removing/omitting code is where you store values repeatedly to the same variable multiple times:
int x;
for(i = 0; i < large_number; i++)
x = i * i;
can be replaced with:
x = (large_number-1) * (large_number-1);
Sometimes, you can use volatile to ensure that something REALLY happens, but in a benchmark, that CAN be detrimental, since the compiler also can't optimise code that it SHOULD optimise (if you are not careful with the how you use volatile).
If you have some SPECIFIC code that you care particularly about, it would be best to post it (and compile it with several state of the art compilers, and see what they actually do with it).
[Note that moving code around is definitely not a BAD thing in general - I do want my compiler (whether it is the one I'm writing myself, or one that I'm using that was written by someone else) to make optimisation by moving code, because, as long as it does so correctly, it will produce faster/better code by doing so!]

Most of the time, reordering is only allowed in situations where the observable effects of the program are the same - this means you shouldn't be able to tell.
Counterexamples do exist, for example the order of operands is unspecified and an optimizer is free to rearrange things. You can't predict the order of these two function calls for example:
int a = foo() + bar();
Read up on sequence points to see what guarantees are made.

Related

How much do C/C++ compilers optimize conditional statements?

I recently ran into a situation where I wrote the following code:
for(int i = 0; i < (size - 1); i++)
{
// do whatever
}
// Assume 'size' will be constant during the duration of the for loop
When looking at this code, it made me wonder how exactly the for loop condition is evaluated for each loop. Specifically, I'm curious as to whether or not the compiler would 'optimize away' any additional arithmetic that has to be done for each loop. In my case, would this code get compiled such that (size - 1) would have to be evaluated for every loop iteration? Or is the compiler smart enough to realize that the 'size' variable won't change, thus it could precalculate it for each loop iteration.
This then got me thinking about the general case where you have a conditional statement that may specify more operations than necessary.
As an example, how would the following two pieces of code compile:
if(6)
if(1+1+1+1+1+1)
int foo = 1;
if(foo + foo + foo + foo + foo + foo)
How smart is the compiler? Will the 3 cases listed above be converted into the same machine code?
And while I'm at, why not list another example. What does the compiler do if you are doing an operation within a conditional that won't have any effect on the end result? Example:
if(2*(val))
// Assume val is an int that can take on any value
In this example, the multiplication is completely unnecessary. While this case seems a lot stupider than my original case, the question still stands: will the compiler be able to remove this unnecessary multiplication?
Question:
How much optimization is involved with conditional statements?
Does it vary based on compiler?
Short answer: the compiler is exceptionally clever, and will generally optimise those cases that you have presented (including utterly ignoring irrelevant conditions).
One of the biggest hurdles language newcomers face in terms of truly understanding C++, is that there is not a one-to-one relationship between their code and what the computer executes. The entire purpose of the language is to create an abstraction. You are defining the program's semantics, but the computer has no responsibility to actually follow your C++ code line by line; indeed, if it did so, it would be abhorrently slow as compared to the speed we can expect from modern computers.
Generally speaking, unless you have a reason to micro-optimise (game developers come to mind), it is best to almost completely ignore this facet of programming, and trust your compiler. Write a program that takes the inputs you want, and gives the outputs you want, after performing the calculations you want… and let your compiler do the hard work of figuring out how the physical machine is going to make all that happen.
Are there exceptions? Certainly. Sometimes your requirements are so specific that you do know better than the compiler, and you end up optimising. You generally do this after profiling and determining what your bottlenecks are. And there's also no excuse to write deliberately silly code. After all, if you go out of your way to ask your program to copy a 50MB vector, then it's going to copy a 50MB vector.
But, assuming sensible code that means what it looks like, you really shouldn't spend too much time worrying about this. Because modern compilers are so good at optimising, that you'd be a fool to try to keep up.
The C++ language specification permits the compiler to make any optimization that results in no observable changes to the expected results.
If the compiler can determine that size is constant and will not change during execution, it can certainly make that particular optimization.
Alternatively, if the compiler can also determine that i is not used in the loop (and its value is not used afterwards), that it is used only as a counter, it might very well rewrite the loop to:
for(int i = 1; i < size; i++)
because that might produce smaller code. Even if this i is used in some fashion, the compiler can still make this change and then adjust all other usage of i so that the observable results are still the same.
To summarize: anything goes. The compiler may or may not make any optimization change as long as the observable results are the same.
Yes, there is a lot of optimization, and it is very complex.
It varies based on the compiler, and it also varies based on the compiler options
Check
https://meta.stackexchange.com/questions/25840/can-we-stop-recommending-the-dragon-book-please
for some book recomendations if you really want to understand what a compiler may do. It is a very complex subject.
You can also compile to assembly with the -S option (gcc / g++) to see what the compiler is really doing. Use -O3 / ... / -O0 / -O to experiment with different optimization levels.

Performance impact of using 'break' inside 'for-loop'

I have done my best and read a lot of Q&As on SO.SE, but I haven't found an answer to my particular question. Most for-loop and break related question refer to nested loops, while I am concerned with performance.
I want to know if using a break inside a for-loop has an impact on the performance of my C++ code (assuming the break gets almost never called). And if it has, I would also like to know tentatively how big the penalization is.
I am quite suspicions that it does indeed impact performance (although I do not know how much). So I wanted to ask you. My reasoning goes as follows:
Independently of the extra code for the conditional statements that
trigger the break (like an if), it necessarily ads additional
instructions to my loop.
Further, it probably also messes around when my compiler tries to
unfold the for-loop, as it no longer knows the number of iterations
that will run at compile time, effectively rendering it into a
while-loop.
Therefore, I suspect it does have a performance impact, which could be
considerable for very fast and tight loops.
So this takes me to a follow-up question. Is a for-loop & break performance-wise equal to a while-loop? Like in the following snippet, where we assume that checkCondition() evaluates 99.9% of the time as true. Do I loose the performance advantage of the for-loop?
// USING WHILE
int i = 100;
while( i-- && checkCondition())
{
// do stuff
}
// USING FOR
for(int i=100; i; --i)
{
if(checkCondition()) {
// do stuff
} else {
break;
}
}
I have tried it on my computer, but I get the same execution time. And being wary of the compiler and its optimization voodoo, I wanted to know the conceptual answer.
EDIT:
Note that I have measured the execution time of both versions in my complete code, without any real difference. Also, I do not trust compiling with -s (which I usually do) for this matter, as I am not interested in the particular result of my compiler. I am rather interested in the concept itself (in an academic sense) as I am not sure if I got this completely right :)
The principal answer is to avoid spending time on similar micro optimizations until you have verified that such condition evaluation is a bottleneck.
The real answer is that CPU have powerful branch prediction circuits which empirically work really well.
What will happen is that your CPU will choose if the branch is going to be taken or not and execute the code as if the if condition is not even present. Of course this relies on multiple assumptions, like not having side effects on the condition calculation (so that part of the body loop depends on it) and that that condition will always evaluate to false up to a certain point in which it will become true and stop the loop.
Some compilers also allow you to specify the likeliness of an evaluation as a hint the branch predictor.
If you want to see the semantic difference between the two code versions just compile them with -S and examinate the generated asm code, there's no other magic way to do it.
The only sensible answer to "what is the performance impact of ...", is "measure it". There are very few generic answers.
In the particular case you show, it would be rather surprising if an optimising compiler generated significantly different code for the two examples. On the other hand, I can believe that a loop like:
unsigned sum = 0;
unsigned stop = -1;
for (int i = 0; i<32; i++)
{
stop &= checkcondition(); // returns 0 or all-bits-set;
sum += (stop & x[i]);
}
might be faster than:
unsigned sum = 0;
for (int i = 0; i<32; i++)
{
if (!checkcondition())
break;
sum += x[i];
}
for a particular compiler, for a particular platform, with the right optimization levels set, and for a particular pattern of "checkcondition" results.
... but the only way to tell would be to measure.

Intel C++ Compiler understanding what optimization is performed

I have a code segment which is as simple as :
for( int i = 0; i < n; ++i)
{
if( data[i] > c && data[i] < r )
{
--data[i];
}
}
It's a part of a large function and project. This is actually a rewrite of a different loop, which proved to be time consuming (long loops), but I was surprised by two things :
When data[i] was temporary stored like this :
for( int i = 0; i < n; ++i)
{
const int tmp = data[i];
if( tmp > c && tmp < r )
{
--data[i];
}
}
It became more much slower. I don't claim this should be faster, but I can not understand why it should be so much slower, the compiler should be able to figure out if tmp should be used or not.
But more importantly when I moved the code segment into a separate function it became around four times slower. I wanted to understand what was going on, so I looked in the opt-report and in both cases the loop is vectorized and seem to do the same optimization.
So my question is what can make such a difference on a function which is not called a million times, but is time consuming in itself ? What to look for in the opt-report ?
I could avoid it by just keeping it inlined, but the why is bugging me.
UPDATE :
I should underline that my main concern is to understand, why it became slower, when moved to a separate function. The code example given with tmp variable, was just a strange example I encountered during the process.
You're probably register starved, and the compiler is having to load and store. I'm pretty sure that the native x86 assembly instructions can take memory addresses to operate on- i.e., the compiler can keep those registers free. But by making it local, you may changing the behaviour wrt. aliasing and the compiler may not be able to prove that the faster version has the same semantics, especially if there is some form of multiple threads in here, allowing it to change the code.
The function was slower when in a new segment likely because function calls not only can break the pipeline, but also create poor instruction cache performance (there's extra code for parameter push/pop/etc).
Lesson: Let the compiler do the optimizing, it's smarter than you. I don't mean that as an insult, it's smarter than me too. But really, especially the Intel compiler, those guys know what they're doing when targetting their own platform.
Edit: More importantly, you need to recognize that compilers are targetted at optimizing unoptimized code. They're not targetted at recognizing half-optimized code. Specifically, the compiler will have a set of triggers for each optimization, and if you happen to write your code in such a way as that they're not hit, you can avoid optimizations being performed even if the code is semantically identical.
And you also need to consider implementation cost. Not every function ideal for inlining can be inlined- just because inlining that logic is too complex for the compiler to handle. I know that VC++ will rarely inline with loops, even if the inlining yields benefit. You may be seeing this in the Intel compiler- that the compiler writers simply decided that it wasn't worth the time to implement.
I encountered this when dealing with loops in VC++- the compiler would produce different assembly for two loops in slightly different formats, even though they both achieved the same result. Of course, their Standard library used the ideal format. You may observe a speedup by using std::for_each and a function object.
You're right, the compiler should be able to identify that as unused code and remove it/not compile it. That doesn't mean it actually does identify it and remove it.
Your best bet is to look at the generated assembly and check to see exactly what is going on. Remember, just because a clever compiler could be able to figure out how to do an optimization, it doesn't mean it can.
If you do check, and see that the code is not removed, you might want to report that to the intel compiler team. It sounds like they might have a bug.

Can gcc/g++ tell me when it ignores my register?

When compiling C/C++ codes using gcc/g++, if it ignores my register, can it tell me?
For example, in this code
int main()
{
register int j;
int k;
for(k = 0; k < 1000; k++)
for(j = 0; j < 32000; j++)
;
return 0;
}
j will be used as register, but in this code
int main()
{
register int j;
int k;
for(k = 0; k < 1000; k++)
for(j = 0; j < 32000; j++)
;
int * a = &j;
return 0;
}
j will be a normal variable.
Can it tell me whether a variable I used register is really stored in a CPU register?
You can fairly assume that GCC ignores the register keyword except perhaps at -O0. However, it shouldn't make a difference one way or another, and if you are in such depth, you should already be reading the assembly code.
Here is an informative thread on this topic: http://gcc.gnu.org/ml/gcc/2010-05/msg00098.html . Back in the old days, register indeed helped compilers to allocate a variable into registers, but today register allocation can be accomplished optimally, automatically, without hints. The keyword does continue to serve two purposes in C:
In C, it prevents you from taking the address of a variable. Since registers don't have addresses, this restriction can help a simple C compiler. (Simple C++ compilers don't exist.)
A register object cannot be declared restrict. Because restrict pertains to addresses, their intersection is pointless. (C++ does not yet have restrict, and anyway, this rule is a bit trivial.)
For C++, the keyword has been deprecated since C++11 and proposed for removal from the standard revision scheduled for 2017.
Some compilers have used register on parameter declarations to determine the calling convention of functions, with the ABI allowing mixed stack- and register-based parameters. This seems to be nonconforming, it tends to occur with extended syntax like register("A1"), and I don't know whether any such compiler is still in use.
With respect to modern compilation and optimization techniques, the register annotation does not make any sense at all. In your second program you take the address of j, and registers do not have addresses, but one same local or static variable could perfectly well be stored in two different memory locations during its lifetime, or sometimes in memory and sometimes in a register, or not exist at all. Indeed, an optimizing compiler would compile your nested loops as nothing, because they do not have any effects, and simply assign their final values to k and j. And then omit these assignments because the remaining code does not use these values.
You can't get the address of a register in C, plus the compiler can totally ignore you; C99 standard, section 6.7.1 (pdf):
The implementation may treat any
register declaration simply as an auto
declaration. However, whether or not
addressable storage is actually used,
the address of any part of an object
declared with storage-class specifier
register cannot be computed, either
explicitly (by use of the unary &
operator as discussed in 6.5.3.2) or
implicitly (by converting an array
name to a pointer as discussed in
6.3.2.1). Thus, the only operator that can be applied to an array declared
with storage-class specifier register
is sizeof.
Unless you're mucking around on 8-bit AVRs or PICs, the compiler will probably laugh at you thinking you know best and ignore your pleas. Even on them, I've thought I knew better a couple times and found ways to trick the compiler (with some inline asm), but my code exploded because it had to massage a bunch of other data to work around my stubbornness.
This question, and some of the answers, and several other discussions of the 'register' keywords I've seen -- seem to assume implicitly that all locals are mapped either to a specific register, or to a specific memory location on the stack. This was generally true until 15-25 years ago, and it's true if you turn off optimizing, but it's not true at all
when standard optimizing is performed. Locals are now seen by optimizers as symbolic names that you use to describe the flow of data, rather than as values that need to be stored in specific locations.
Note: by 'locals' here I mean: scalar variables, of storage class auto (or 'register'), which are never used as the operand of '&'. Compilers can sometimes break up auto structs, unions or arrays into individual 'local' variables, too.
To illustrate this: suppose I write this at the top of a function:
int factor = 8;
.. and then the only use of the factor variable is to multiply by various things:
arr[i + factor*j] = arr[i - factor*k];
In this case - try it if you want - there will be no factor variable. The code analysis will show that factor is always 8, and so all the shifts will turn into <<3. If you did the same thing in 1985 C, factor would get a location on the stack, and there would be multipilies, since the compilers basically worked one statement at a time and didn't remember anything about the values of the variables. Back then programmers would be more likely to use #define factor 8 to get better code in this situation, while maintaining adjustable factor.
If you use -O0 (optimization off) - you will indeed get a variable for factor. This will allow you, for instance, to step over the factor=8 statement, and then change factor to 11 with the debugger, and keep going. In order for this to work, the compiler can't keep anything in registers between statements, except for variables which are assigned to specific registers; and in that case the debugger is informed of this. And it can't try to 'know' anything about the values of variables, since the debugger could change them. In other words, you need the 1985 situation if you want to change local variables while debugging.
Modern compilers generally compile a function as follows:
(1) when a local variable is assigned to more than once in a function, the compiler creates different 'versions' of the variable so that each one is only assigned in one place. All of the 'reads' of the variable refer to a specific version.
(2) Each of these locals is assigned to a 'virtual' register. Intermediate calculation results are also assigned variables/registers; so
a = b*c + 2*k;
becomes something like
t1 = b*c;
t2 = 2;
t3 = k*t2;
a = t1 + t3;
(3) The compiler then takes all these operations, and looks for common subexpressions, etc. Since each of the new registers is only ever written once, it is rather easier to rearrange them while maintaining correctness. I won't even start on loop analysis.
(4) The compiler then tries to map all these virtual registers into actual registers in order to generate code. Since each virtual register has a limited lifetime it is possible to reuse actual registers heavily - 't1' in the above is only needed until the add which generates 'a', so it could be held in the same register as 'a'. When there are not enough registers, some of the virtual registers can be allocated to memory -- or -- a value can be held in a certain register, stored to memory for a while, and loaded back into a (possibly) different register later. On a load-store machine, where only values in registers can be used in computations, this second strategy accomodates that nicely.
From the above, this should be clear: it's easy to determine that the virtual register mapped to factor is the same as the constant '8', and so all multiplications by factor are multiplications by 8. Even if factor is modified later, that's a 'new' variable and it doesn't affect prior uses of factor.
Another implication, if you write
vara = varb;
.. it may or may not be the case that there is a corresponding copy in the code. For instance
int *resultp= ...
int acc = arr[0] + arr[1];
int acc0 = acc; // save this for later
int more = func(resultp,3)+ func(resultp,-3);
acc += more; // add some more stuff
if( ...){
resultp = getptr();
resultp[0] = acc0;
resultp[1] = acc;
}
In the above the two 'versions' of acc (initial, and after adding 'more') could be in two different registers, and 'acc0' would then be the same as the inital 'acc'. So no register copy would be needed for 'acc0 =acc'.
Another point: the 'resultp' is assigned to twice, and since the second assignment ignores the previous value, there are essentially two distinct 'resultp' variables in the code, and this is easily determined by analysis.
An implication of all this: don't be hesitant to break out complex expressions into smaller ones using additional locals for intermediates, if it makes the code easier to follow. There is basically zero run-time penalty for this, since the optimizer sees the same thing anyway.
If you're interested in learning more you could start here: http://en.wikipedia.org/wiki/Static_single_assignment_form
The point of this answer is to (a) give some idea of how modern compilers work and (b) point out that asking the compiler, if it would be so kind, to put a particular local variable into a register -- doesn't really make sense. Each 'variable' may be seen by the optimizer as several variables, some of which may be heavily used in loops, and others not. Some variables will vanish -- e.g. by being constant; or, sometimes, the temp variable used in a swap. Or calculations not actually used. The compiler is equipped to use the same register for different things in different parts of the code, according to what's actually best on the machine you are compiling for.
The notion of hinting the compiler as to which variables should be in registers assumes that each local variable maps to a register or to a memory location. This was true back when Kernighan + Ritchie designed the C language, but is not true any more.
Regarding the restriction that you can't take the address of a register variable: Clearly, there's no way to implement taking the address of a variable held in a register, but you might ask - since the compiler has discretion to ignore the 'register' - why is this rule in place? Why can't the compiler just ignore the 'register' if I happen to take the address? (as is the case in C++).
Again, you have to go back to the old compiler. The original K+R compiler would parse a local variable declaration, and then immediately decide whether to assign it to a register or not (and if so, which register). Then it would proceed to compile expressions, emitting the assembler for each statement one at a time. If it later found that you were taking the address of a 'register' variable, which had been assigned to a register, there was no way to handle that, since the assignment was, in general, irreversible by then. It was possible, however, to generate an error message and stop compiling.
Bottom line, it appears that 'register' is essentially obsolete:
C++ compilers ignore it completely
C compilers ignore it except to enforce the restriction about & - and possibly don't ignore it at -O0 where it could actually result in allocation as requested. At -O0 you aren't concerned about code speed though.
So, it's basically there now for backward compatibility, and probably on the basis that some implementations could still be using it for 'hints'. I never use it -- and I write real-time DSP code, and spend a fair bit of time looking at generated code and finding ways to make it faster. There are many ways to modify code to make it run faster, and knowing how compilers work is very helpful. It's been a long time indeed since I last found that adding 'register' to be among those ways.
Addendum
I excluded above, from my special definition of 'locals', variables to which & is applied (these are are of course included in the usual sense of the term).
Consider the code below:
void
somefunc()
{
int h,w;
int i,j;
extern int pitch;
get_hw( &h,&w ); // get shape of array
for( int i = 0; i < h; i++ ){
for( int j = 0; j < w; j++ ){
Arr[i*pitch + j] = generate_func(i,j);
}
}
}
This may look perfectly harmless. But if you are concerned about execution speed, consider this: The compiler is passing the addresses of h and w to get_hw, and then later calling generate_func. Let's assume the compiler knows nothing about what's in those functions (which is the general case). The compiler must assume that the call to generate_func could be changing h or w. That's a perfectly legal use of the pointer passed to get_hw - you could store it somewhere and then use it later, as long as the scope containing h,w is still in play, to read or write those variables.
Thus the compiler must store h and w in memory on the stack, and can't determine anything in advance about how long the loop will run. So certain optimizations will be impossible, and the loop could be less efficient as a result (in this example, there's a function call in the inner loop anyway, so it may not make much of a difference, but consider the case where there's a function which is occasionally called in the inner loop, depending on some condition).
Another issue here is that generate_func could change pitch, and so i*pitch needs to done each time, rather than only when i changes.
It can be recoded as:
void
somefunc()
{
int h0,w0;
int h,w;
int i,j;
extern int pitch;
int apit = pitch;
get_hw( &h0,&w0 ); // get shape of array
h= h0;
w= w0;
for( int i = 0; i < h; i++ ){
for( int j = 0; j < w; j++ ){
Arr[i*apit + j] = generate_func(i,j);
}
}
}
Now the variables apit,h,w are all 'safe' locals in the sense I defined above, and the compiler can be sure they won't be changed by any function calls. Assuming I'm not modifying anything in generate_func, the code will have the same effect as before but could be more efficient.
Jens Gustedt has suggested the use of the 'register' keyword as a way of tagging key variables to inhibit the use of & on them, e.g. by others maintaining the code (It won't affect the generated code, since the compiler can determine the lack of & without it). For my part, I always think carefully before applying & to any local scalar in a time-critical area of the code, and in my view using 'register' to enforce this is a little cryptic, but I can see the point (unfortunately it doesn't work in C++ since the compiler will just ignore the 'register').
Incidentally, in terms of code efficiency, the best way to have a function return two values is with a struct:
struct hw { // this is what get_hw returns
int h,w;
};
void
somefunc()
{
int h,w;
int i,j;
struct hw hwval = get_hw(); // get shape of array
h = hwval.h;
w = hwval.w;
...
This may look cumbersome (and is cumbersome to write), but it will generate cleaner code than the previous examples. The 'struct hw' will actually be returned in two registers (on most modern ABIs anyway). And due to the way the 'hwval' struct is used, the optimizer will effectively break it up into two 'locals' hwval.h and hwval.w, and then determine that these are equivalent to h and w -- so hwval will essentially disappear in the code. No pointers need to be passed, no function is modifying another function's variables via pointer; it's just like having two distinct scalar return values. This is much easier to do now in C++11 - with std::tie and std::tuple, you can use this method with less verbosity (and without having to write a struct definition).
Your second example is invalid in C. So you see well that the register keyword changes something (in C). It is just there for this purpose, to inhibit the taking of an address of a variable. So just don't take its name "register" verbally, it is a misnomer, but stick to its definition.
That C++ seems to ignore register, well they must have their reason for that, but I find it kind of sad to again find one of these subtle difference where valid code for one is invalid for the other.

Is Loop Hoisting still a valid manual optimization for C code?

Using the latest gcc compiler, do I still have to think about these types of manual loop optimizations, or will the compiler take care of them for me well enough?
If your profiler tells you there is a problem with a loop, and only then, a thing to watch out for is a memory reference in the loop which you know is invariant across the loop but the compiler does not. Here's a contrived example, bubbling an element out to the end of an array:
for ( ; i < a->length - 1; i++)
swap_elements(a, i, i+1);
You may know that the call to swap_elements does not change the value of a->length, but if the definition of swap_elements is in another source file, it is quite likely that the compiler does not. Hence it can be worthwhile hoisting the computation of a->length out of the loop:
int n = a->length;
for ( ; i < n - 1; i++)
swap_elements(a, i, i+1);
On performance-critical inner loops, my students get measurable speedups with transformations like this one.
Note that there's no need to hoist the computation of n-1; any optimizing compiler is perfectly capable of discovering loop-invariant computations among local variables. It's memory references and function calls that may be more difficult. And the code with n-1 is more manifestly correct.
As others have noted, you have no business doing any of this until you've profiled and have discovered that the loop is a performance bottleneck that actually matters.
Write the code, profile it, and only think about optimising it when you have found something that is not fast enough, and you can't think of an alternative algorithm that will reduce/avoid the bottleneck in the first place.
With modern compilers, this advice is even more important - if you write simple clean code, the compiler's optimiser can often do a better job of optimising the code than it can if you try to give it snazzy "pre-optimised" code.
Check the generated assembly and see for yourself. See if the computation for the loop-invariant code is being done inside the loop or outside the loop in the assembly code that your compiler generates. If it's failing to do the loop hoisting, do the hoisting yourself.
But as others have said, you should always profile first to find your bottlenecks. Once you've determined that this is in fact a bottleneck, only then should you check to see if the compiler's performing loop hoisting (aka loop-invariant code motion) in the hot spots. If it's not, help it out.
Compilers generally do an excellent job with this type of optimization, but they do miss some cases. Generally, my advice is: write your code to be as readable as possible (which may mean that you hoist loop invariants -- I prefer to read code written that way), and if the compiler misses optimizations, file bugs to help fix the compiler. Only put the optimization into your source if you have a hard performance requirement that can't wait on a compiler fix, or the compiler writers tell you that they're not going to be able to address the issue.
Where they are likely to be important to performance, you still have to think about them.
Loop hoisting is most beneficial when the value being hoisted takes a lot of work to calculate. If it takes a lot of work to calculate, it's probably a call out of line. If it's a call out of line, the latest version of gcc is much less likely than you are to figure out that it will return the same value every time.
Sometimes people tell you to profile first. They don't really mean it, they just think that if you're smart enough to figure out when it's worth worrying about performance, then you're smart enough to ignore their rule of thumb. Obviously, the following code might as well be "prematurely optimized", whether you have profiled or not:
#include <iostream>
bool isPrime(int p) {
for (int i = 2; i*i <= p; ++i) {
if ((p % i) == 0) return false;
}
return true;
}
int countPrimesLessThan(int max) {
int count = 0;
for (int i = 2; i < max; ++i) {
if (isPrime(i)) ++count;
}
return count;
}
int main() {
for (int i = 0; i < 10; ++i) {
std::cout << "The number of primes less than 1 million is: ";
std::cout << countPrimesLessThan(1000*1000);
std::cout << std::endl;
}
}
It takes a "special" approach to software development not to manually hoist that call to countPrimesLessThan out of the loop, whether you've profiled or not.
Early optimizations are bad only if other aspects - like readability, clarity of intent, or structure - are negatively affected.
If you have to declare it anyway, loop hoisting can even improve clarity, and it explicitely documents your assumption "this value doesn't change".
As a rule of thumb I wouldn't hoist the count/end iterator for a std::vector, because it's a common scenario easily optimized. I wouldn't hoist anything that I can trust my optimizer to hoist, and I wouldn't hoist anything known to be not critical - e.g. when running through a list of dozen windows to respond to a button click. Even if it takes 50ms, it will still appear "instanteneous" to the user. (But even that is a dangerous assumption: if a new feature requires looping 20 times over this same code, it suddenly is slow). You should still hoist operations such as opening a file handle to append, etc.
In many cases - very well in loop hoisting - it helps a lot to consider relative cost: what is the cost of the hoisted calculation compared to the cost of running through the body?
As for optimizations in general, there are quite some cases where the profiler doesn't help. Code may have very different behavior depending on the call path. Library writers often don't know their call path otr frequency. Isolating a piece of code to make things comparable can already alter the behavior significantly. The profiler may tell you "Loop X is slow", but it won't tell you "Loop X is slow because call Y is thrashing the cache for everyone else". A profiler couldn't tell you "this code is fast because of your snarky CPU, but it will be slow on Steve's computer".
A good rule of thumb is usually that the compiler performs the optimizations it is able to.
Does the optimization require any knowledge about your code that isn't immediately obvious to the compiler? Then it is hard for the compiler to apply the optimization automatically, and you may want to do it yourself
In most cases, lop hoisting is a fully automatic process requiring no high-level knowledge of the code -- just a lot of lifetime and dependency analysis, which is what the compiler excels at in the first place.
It is possible to write code where the compiler is unable to determine whether something can be hoisted out safely though -- and in those cases, you may want to do it yourself, as it is a very efficient optimization.
As an example, take the snippet posted by Steve Jessop:
for (int i = 0; i < 10; ++i) {
std::cout << "The number of primes less than 1 billion is: ";
std::cout << countPrimesLessThan(1000*1000*1000);
std::cout << std::endl;
}
Is it safe to hoist out the call to countPrimesLessThan? That depends on how and where the function is defined. What if it has side effects? It may make an important difference whether it is called once or ten times, as well as when it is called. If we don't know how the function is defined, we can't move it outside the loop. And the same is true if the compiler is to perform the optimization.
Is the function definition visible to the compiler? And is the function short enough that we can trust the compiler to inline it, or at least analyze the function for side effects? If so, then yes, it will hoist it outside the loop.
If the definition is not visible, or if the function is very big and complicated, then the compiler will probably assume that the function call can not be moved safely, and then it won't automatically hoist it out.
Remember 80-20 Rule.(80% of execution time is spent on 20% critical code in the program)
There is no meaning in optimizing the code which have no significant effect on program's overall efficiency.
One should not bother about such kind of local optimization in the code.So the best approach is to profile the code to figure out the critical parts in the program which consumes heavy CPU cycles and try to optimize it.This kind of optimization will really makes some sense and will result in improved program efficiency.