I am running while loop in 4 thread, in the loop I am evaluating function and incrementally increasing counter.
while(1) {
int fitness = EnergyFunction::evaluate(sequence);
mutex.lock();
counter++;
mutex.unlock();
}
When I run this loop, as I said in 4 running threads, I get ~ 20 000 000 evaluations per second.
while(1) {
if (dist(mt) == 0) {
sequence[distDim(mt)] = -1;
} else {
sequence[distDim(mt)] = 1;
}
int fitness = EnergyFunction::evaluate(sequence);
mainMTX.lock();
overallGeneration++;
mainMTX.unlock();
}
If I add some random mutation for the sequence, I get ~ 13 000 000 evaluations per second.
while(1) {
if (dist(mt) == 0) {
sequence[distDim(mt)] = -1;
} else {
sequence[distDim(mt)] = 1;
}
int fitness = EnergyFunction::evaluate(sequence);
mainMTX.lock();
if(fitness < overallFitness)
overallFitness = fitness;
overallGeneration++;
mainMTX.unlock();
}
But when I add simple if statement that checks, if new fitness is smaller than old fitness if that is true then replace old fitness with new fitness.
But performance loss is massive! Now I get ~ 20 000 evaluations per second. If I remove random mutation part, I also get ~ 20 000 evaluations per second.
Variable overallFitness is declared as
extern int overallFitness;
I am having troubles figuring out what is the problem for such a big performance loss. Is comparing two int such time taking operation?
Also I don't believe that is related to mutex locking.
UPDATE
This performance loss was not because of branch prediction, but compiler just ignored this call int fitness = EnergyFunction::evaluate(sequence);.
Now I added volatile and compiler doesn't ignore the call anymore.
Also thank you for pointing out branch misprediction and atomic<int>, didn't know about them!
Because of atomic I also remove mutex part, so the final code looks like this:
while(1) {
sequence[distDim(mt)] = lookup_Table[dist(mt)];
fitness = EnergyFunction::evaluate(sequence);
if(fitness < overallFitness)
overallFitness = fitness;
++overallGeneration;
}
Now I am getting ~ 25 000 evaluations per second.
You need to run a profiler to get to the bottom of this. On Linux, use perf.
My guess is that EnergyFunction::evaluate() is being entirely optimized away, because in the first examples, you don't use the result. So the compiler can discard the whole thing. You can try writing the return value to a volatile variable, which should force the compiler or linker to not optimize the call away. 1000x speed up is definitely not attributable to a simple comparison.
There is actually an atomic instruction to increase an int by 1. So a smart compiler may be able to entirely remove the mutex, altough I'd be surprised if it did. You can test this by looking at the assembly, or by removing the mutex and changing the type of overallGeneration to atomic<int> an check how fast it still is. This optimization is no longer possible with your last, slow example.
Also, if the compiler can see that evaluate does nothing to the global state and the result isn't used, then it can skip the entire call to evaluate. You can find out if that's the case by looking at the assembly or by removing the call to EnergyFunction::evaluate(sequence) and look at the timing - if it doesn't speed up, the function wasn't called in the first place. This optimization is no longer possible with your last, slow example. You should be able to stop the compiler from not executing EnergyFunction::evaluate(sequence) by defining the function in a different object file (other cpp or library) and disabling link time optimization.
There are other effects here that also create a performance difference, but I can't see any other effects that can explain a difference of factor 1000. A factor 1000 usually means the compiler cheated in the previous test and the change now prevents it from cheating.
I am not sure that my answer will give an explanation for such a dramatic performance drop but it definitely may have impact on it.
In the first case you added branches to the non-critical area:
if (dist(mt) == 0) {
sequence[distDim(mt)] = -1;
} else {
sequence[distDim(mt)] = 1;
}
In this case the CPU (at least IA) will perform branch prediction and in case of branch miss-prediction there is a performance penalty - this is a known fact.
Now regarding the second addition, you added a branch to the critical area:
mainMTX.lock();
if(fitness < overallFitness)
overallFitness = fitness;
overallGeneration++;
mainMTX.unlock();
Which in its turn, in addition to the "miss-prediction" penalty increased the amount of code which is executed in that area and thus the probability that other threads will have to wait for mainMTX.unlock();.
NOTE
Please make sure that all the global/shared resources are defined as volatile. Otherwise the compiler may optimize them out (which might explain such high number of evaluations at the very beginning).
In case of overallFitness it probably won't be optimized out because it is declared as extern but overallGeneration may be optimized out. If this is the case, then it may explain this performance drop after adding the "real" memory access in the critical area.
NOTE2
I am still not sure that the explanation I provided may explain such significant performance drop. So I believe there might be some implementation details in the code which you didn't post (like volatile for example).
EDIT
As Peter (#Peter) Mark Lakata (#MarkLakata) stated in the separate answers, and I tend to agree with them, most likely that the reason for the performance drop is that in the first case fitness was never used so the compiler just optimized that variable out together with the function call. While in the second case fitness was used so the compiler didn't optimize it. Good catch Peter and Mark! I just missed that point.
I realize this is not strictly an answer to the question but an alternative to the problem presented as it was.
Is overallGeneration used while the code is running? That is, is it perhaps used to determine when to stop computation? If it is not, you could forego synchronizing a global counter and have a counter per thread and after the computation is done, sum up all the per-thread counters to a grand total. Similarly for overallFitness, you could keep track of maxFitness per thread and pick the maximum of the four results once computation is over.
Having no thread synchronization at all would get you 100% CPU usage.
Related
I have done my best and read a lot of Q&As on SO.SE, but I haven't found an answer to my particular question. Most for-loop and break related question refer to nested loops, while I am concerned with performance.
I want to know if using a break inside a for-loop has an impact on the performance of my C++ code (assuming the break gets almost never called). And if it has, I would also like to know tentatively how big the penalization is.
I am quite suspicions that it does indeed impact performance (although I do not know how much). So I wanted to ask you. My reasoning goes as follows:
Independently of the extra code for the conditional statements that
trigger the break (like an if), it necessarily ads additional
instructions to my loop.
Further, it probably also messes around when my compiler tries to
unfold the for-loop, as it no longer knows the number of iterations
that will run at compile time, effectively rendering it into a
while-loop.
Therefore, I suspect it does have a performance impact, which could be
considerable for very fast and tight loops.
So this takes me to a follow-up question. Is a for-loop & break performance-wise equal to a while-loop? Like in the following snippet, where we assume that checkCondition() evaluates 99.9% of the time as true. Do I loose the performance advantage of the for-loop?
// USING WHILE
int i = 100;
while( i-- && checkCondition())
{
// do stuff
}
// USING FOR
for(int i=100; i; --i)
{
if(checkCondition()) {
// do stuff
} else {
break;
}
}
I have tried it on my computer, but I get the same execution time. And being wary of the compiler and its optimization voodoo, I wanted to know the conceptual answer.
EDIT:
Note that I have measured the execution time of both versions in my complete code, without any real difference. Also, I do not trust compiling with -s (which I usually do) for this matter, as I am not interested in the particular result of my compiler. I am rather interested in the concept itself (in an academic sense) as I am not sure if I got this completely right :)
The principal answer is to avoid spending time on similar micro optimizations until you have verified that such condition evaluation is a bottleneck.
The real answer is that CPU have powerful branch prediction circuits which empirically work really well.
What will happen is that your CPU will choose if the branch is going to be taken or not and execute the code as if the if condition is not even present. Of course this relies on multiple assumptions, like not having side effects on the condition calculation (so that part of the body loop depends on it) and that that condition will always evaluate to false up to a certain point in which it will become true and stop the loop.
Some compilers also allow you to specify the likeliness of an evaluation as a hint the branch predictor.
If you want to see the semantic difference between the two code versions just compile them with -S and examinate the generated asm code, there's no other magic way to do it.
The only sensible answer to "what is the performance impact of ...", is "measure it". There are very few generic answers.
In the particular case you show, it would be rather surprising if an optimising compiler generated significantly different code for the two examples. On the other hand, I can believe that a loop like:
unsigned sum = 0;
unsigned stop = -1;
for (int i = 0; i<32; i++)
{
stop &= checkcondition(); // returns 0 or all-bits-set;
sum += (stop & x[i]);
}
might be faster than:
unsigned sum = 0;
for (int i = 0; i<32; i++)
{
if (!checkcondition())
break;
sum += x[i];
}
for a particular compiler, for a particular platform, with the right optimization levels set, and for a particular pattern of "checkcondition" results.
... but the only way to tell would be to measure.
I've heard so many times that an optimizer may reorder your code that I'm starting to believe it.
Are there any examples or typical cases where this might happen and how can I Avoid such a thing (eg I want a benchmark to be impervious to this)?
There are LOTS of different kinds of "code-motion" (moving code around), and it's caused by lots of different parts of the optimisation process:
move these instructions around, because it's a waste of time to wait for the memory read to complete without putting at least one or two instructions between the memory read and the operation using the content we got from memory
Move things out of loops, because it only needs to happen once (if you call x = sin(y) once or 1000 times without changing y, x will have the same value, so no point in doing that inside a loop. So compiler moves it out.
Move code around based on "compiler expects this code to hit more often than the other bit, so better cache-hit ratio if we do it this way" - for example error handling being moved away from the source of the error, because it's unlikely that you get an error [compilers often understand commonly used functions and that they typically result in success].
Inlining - code is moved from the actual function into the calling function. This often leads to OTHER effects such as reduction in pushing/poping registers from the stack and arguments can be kept where they are, rather than having to move them to the "right place for arguments".
I'm sure I've missed some cases in the above list, but this is certainly some of the most common.
The compiler is perfectly within its rights to do this, as long as it doesn't have any "observable difference" (other than the time it takes to run and the number of instructions used - those "don't count" in observable differences when it comes to compilers)
There is very little you can do to avoid compiler from reordering your code - you can write code that ensures the order to some degree. So for example, we can have code like this:
{
int sum = 0;
for(i = 0; i < large_number; i++)
sum += i;
}
Now, since sum isn't being used, the compiler can remove it. Adding some code that checks prints the sum would ensure that it's "used" according to the compiler.
Likewise:
for(i = 0; i < large_number; i++)
{
do_stuff();
}
if the compiler can figure out that do_stuff doesn't actually change any global value, or similar, it will move code around to form this:
do_stuff();
for(i = 0; i < large_number; i++)
{
}
The compiler may also remove - in fact almost certainly will - the, now, empty loop so that it doesn't exist at all. [As mentioned in the comments: If do_stuff doesn't actually change anything outside itself, it may also be removed, but the example I had in mind is where do_stuff produces a result, but the result is the same each time]
(The above happens if you remove the printout of results in the Dhrystone benchmark for example, since some of the loops calculate values that are never used other than in the printout - this can lead to benchmark results that exceed the highest theoretical throughput of the processor by a factor of 10 or so - because the benchmark assumes the instructions necessary for the loop were actually there, and says it took X nominal operations to execute each iteration)
There is no easy way to ensure this doesn't happen, aside from ensuring that do_stuff either updates some variable outside the function, or returns a value that is "used" (e.g. summing up, or something).
Another example of removing/omitting code is where you store values repeatedly to the same variable multiple times:
int x;
for(i = 0; i < large_number; i++)
x = i * i;
can be replaced with:
x = (large_number-1) * (large_number-1);
Sometimes, you can use volatile to ensure that something REALLY happens, but in a benchmark, that CAN be detrimental, since the compiler also can't optimise code that it SHOULD optimise (if you are not careful with the how you use volatile).
If you have some SPECIFIC code that you care particularly about, it would be best to post it (and compile it with several state of the art compilers, and see what they actually do with it).
[Note that moving code around is definitely not a BAD thing in general - I do want my compiler (whether it is the one I'm writing myself, or one that I'm using that was written by someone else) to make optimisation by moving code, because, as long as it does so correctly, it will produce faster/better code by doing so!]
Most of the time, reordering is only allowed in situations where the observable effects of the program are the same - this means you shouldn't be able to tell.
Counterexamples do exist, for example the order of operands is unspecified and an optimizer is free to rearrange things. You can't predict the order of these two function calls for example:
int a = foo() + bar();
Read up on sequence points to see what guarantees are made.
while(some_condition){
if(FIRST)
{
do_this;
}
else
{
do_that;
}
}
In my program the possibility of if(FIRST) succeeding is about 1 in 10000. Can there be any alternative in C/C++ such that we can avoid checking the condition on every iteration inside the while loop with the hope of seeing a better performance in this case.
Ok! Let me put in some more detail.
i am writing a code for a signal acquisiton and tracking scheme where the state of my system will remain in TRACKING mode more often that ACQUISITION mode.
while(signal_present)
{
if(ACQUISITION_SUCCEEDED)
{
do_tracking(); // this functions can change the state from TRACKING to ACQUISITION
}
else
{
do_acquisition(); // this function can change the state from ACQUISITION to TRACKING
}
}
So what happens here is that the system usually remains in tracking mode but it can enter acquisition mode when tracking fails but is not a common occurrence.( Assume the incoming data to be infinite in number. )
The performance cost of a single branch is not going to be a big deal. The only thing you really can do is put the most likely code first, save on some instruction cache. Maybe. This is really deep into micro-optimization.
There is no particularly good reason to try to optimize this. Almost all modern architectures incorporate branch predictors. These speculate that a branch (an if or else) will be taken essentially the way it has been in the past. In your case, the speculation will always succeed, eliminating all overhead. There are non-portable ways to hint that a condition is taken one way or another, but any branch predictor will work just as well.
One thing you might want to do to improve instruction-cache locality is to move do_that out of the while loop (unless it is a function call).
The GCC has a __builtin_expect “function” that you can use to indicate to the compiler which branch will likely be taken. You could use it like this:
if(__builtin_expect(FIRST, 1)) …
Is this useful? I have no idea. I have never used it, never seen it used (except allegedly in the Linux kernel). The GCC documentation actually discourages its usage in favour of using profiling information to achieve a more reliable metric.
On recent x86 processor systems, final execution speed will barely rely on source code implementation.
You can have a look at this page http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/ to see amount the optimization that occurs inside the processor.
If this test is really consuming significant time compared to the implementation of do_aquisition, then you might get a boost by having a function table:
typedef void (*trackfunc)(void);
trackfunc tracking_action[] = {do_acquisition, do_tracking};
while (signal_present)
{
tracking_action[ACQUISITION_STATE]();
}
The effects of these kinds of manual optimizations are very dependent on the platform, the compiler, and the optimization settings.
You will most likely get a much greater performance gain by spending your time measuring and tuning the do_aquisition and do_tracking algorithms.
If you don't know when "FIRST" will be true, then no.
The issue is whether FIRST is time consuming or not; maybe you could evaluate FIRST before the loop (or part of it) and just test the boolean.
I'd change moonshadow's code a little bit to
while( some_condition )
{
do_that;
if( FIRST )
{
do_this; // overwrite what you did earlier.
}
}
Based on your new information, I'd say something like the following:
while(some_condition)
{
while(ACQUISITION_SUCCEEDED)
{
do_tracking();
}
if (some_condition)
while(!ACQUISITION_SUCCEEDED)
{
do_acquisition();
}
}
The point is that the ACQUISITION_SUCCEEDED state must include the some_condition information to a certain extent (i.e. it will break out of the inner loops if some_condition is false - hence there is a chance to break out of the outer loop)
This is a classic in optimization. You should avoid putting conditionals within loops if you can. This code:
while(...)
{
if( a )
{
foo();
}
else
{
bar();
}
}
is often better to rewrite as:
if( a )
{
while(...)
{
foo();
}
}
else
{
while(...)
{
bar();
}
}
It's not always possible though, and you should always when you try to optimize something measure the performance before and after.
There is not much more useful optimizing you can do with your example.
The call / branch to the do_this and do_that may negate any savings you earned by optimizing an if-then-else statement.
One of the rules of performance optimizing is to reduce branches. Most processors prefer to execute sequential code. They can take a chunk of sequential code and haul it into their caches. Branching interrupts this pleasantry and may cause a complete reload of the instruction cache (which loses valuable execution time).
Before you micro-optimize at this level, review your design to see if you can:
Eliminate unnecessary branching.
Split up code so it fits into the
cache.
Organize the data to reduce fetches
from memory or hard drive.
I'm sure that the above steps will gain you more performance than optimizing your posted loop.
I've been working on a piece of code recently where performance is very important, and essentially I have the following situation:
int len = some_very_big_number;
int counter = some_rather_small_number;
for( int i = len; i >= 0; --i ){
while( counter > 0 && costly other stuff here ){
/* do stuff */
--counter;
}
/* do more stuff */
}
So here I have a loop that runs very often and for a certain number of runs the while block will be executed as well until the variable counter is reduced to zero and then the while loop will not be called because the first expression will be false.
The question is now, if there is a difference in performance between using
counter > 0 and counter != 0?
I suspect there would be, does anyone know specifics about this.
To measure is to know.
Do you think that what will solve your problem! :D
if(x >= 0)
00CA1011 cmp dword ptr [esp],0
00CA1015 jl main+2Ch (0CA102Ch) <----
...
if(x != 0)
00CA1026 cmp dword ptr [esp],0
00CA102A je main+3Bh (0CA103Bh) <----
In programming, the following statement is the sign designating the road to Hell:
I've been working on a piece of code recently where performance is very important
Write your code in the cleanest, most easy to understand way. Period.
Once that is done, you can measure its runtime. If it takes too long, measure the bottlenecks, and speed up the biggest ones. Keep doing that until it is fast enough.
The list of projects that failed or suffered catastrophic loss due to a misguided emphasis on blind optimization is large and tragic. Don't join them.
I think you're spending time optimizing the wrong thing. "costly other stuff here", "do stuff" and "do more stuff" are more important to look at. That is where you'll make the big performance improvements I bet.
There will be a huge difference if the counter starts with a negative number. Otherwise, on every platform I'm familiar with, there won't be a difference.
Is there a difference between counter > 0 and counter != 0? It depends on the platform.
A very common type of CPU are those from Intel we have in our PC's. Both comparisons will map to a single instruction on that CPU and I assume they will execute at the same speed. However, to be certain you will have to perform your own benchmark.
As Jim said, when in doubt see for yourself :
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
using namespace boost::posix_time;
using namespace std;
void main()
{
ptime Before = microsec_clock::universal_time(); // UTC NOW
// do stuff here
ptime After = microsec_clock::universal_time(); // UTC NOW
time_duration delta_t = After - Before; // How much time has passed?
cout << delta_t.total_seconds() << endl; // how much seconds total?
cout << delta_t.fractional_seconds() << endl; // how much microseconds total?
}
Here's a pretty nifty way of measuring time. Hope that helps.
OK, you can measure this, sure. However, these sorts of comparisons are so fast that you are probably going to see more variation based on processor swapping and scheduling then on this single line of code.
This smells of unnecessary, and premature, optimization. Right your program, optimize what you see. If you need more, profile, and then go from there.
I would add that the overwhelming performance aspects of this code on modern cpus will be dominated not by the comparison instruction but whether the comparison is well predicted since any mis-predict will waste many more cycles than any integral operation.
As such loop unrolling will probably be the biggest winner but measure, measure, measure.
Thinking that the type of comparison is going to make a difference, without knowing it, is the definition of guessing.
Don't guess.
In general, they should be equivalent (both are usually implemented in single-cycle instructions/micro-ops). Your compiler may do some strange special-case optimization that is difficult to reason about from the source level, which may make either one slightly faster. Also, equality testing is more energy-efficient than inequality testing (>), though the system-level effect is so small as to not merit discussion.
There may be no difference. You could try examining the assembly output for each.
That being said, the only way to tell if any difference is significant is to try it both ways and measure. I'd bet that the change makes no difference whatsoever with optimizations on.
Assuming you are developing for the x86 architecture, when you look at the assembly output it will come down to jns vs jne. jns will check the sign flag and jne will check the zero flag. Both operations, should as far as I know, be equally costly.
Clearly the solution is to use the correct data type.
Make counter an unsigned int. Then it can't be less than zero. Your compiler will obviously know this and be forced to choose the optimal solution.
Or you could just measure it.
You could also think about how it would be implemented...(here we go on a tangent)...
less than zero: the sign bit would be set, so need to check 1 bit
equal to zero : the whole value would be zero, so need to check all the bits
Of course, computers are funny things, and it may take longer to check a single bit than the whole value (however many bytes it is on your platform).
You could just measure it...
And you could find out that one it more optimal than another (under the conditions you measured it). But your program will still run like a dog because you spent all your time optimising the wrong part of your code.
The best solution is to use what many large software companies do - blame the hardware for not runnnig fast enough and encourage your customer to upgrade their equipment (which is clearly inferior since your product doesn't run fast enough).
< /rant>
I stumbled across this question just now, 3 years after it is asked, so I am not sure how useful the answer will still be... Still, I am surprised not to see clearly stated that answering your question requires to know two and only two things:
which processor you target
which compiler you work with
To the first point, each processor has different instructions for tests. On one given processor, two similar comparisons may turn up to take a different number of cycles. For example, you may have a 1-cycle instruction to do a gt (>), eq (==), or a le (<=), but no 1-cycle instruction for other comparisons like a ge (>=). Following a test, you may decide to execute conditional instructions, or, more often, as in your code example, take a jump. There again, cycles spent in jumps take a variable number of cycles on most high-end processors, depending whether the conditional jump is taken or not taken, predicted or not predicted. When you write code in assembly and your code is time critical, you can actually take quite a bit of time to figure out how to best arrange your code to minimize overall the cycle count and may end up in a solution that may have to be optimized based on the number of time a given comparison returns a true or false.
Which leads me to the second point: compilers, like human coders, try to arrange the code to take into account the instructions available and their latencies. Their job is harder because some assumptions an assembly code would know like "counter is small" is hard (not impossible) to know. For trivial cases like a loop counter, most modern compilers can at least recognize the counter will always be positive and that a != will be the same as a > and thus generate the best code accordingly. But that, as many mentioned in the posts, you will only know if you either run measurements, or inspect your assembly code and convince yourself this is the best you could do in assembly. And when you upgrade to a new compiler, you may then get a different answer.
Using the latest gcc compiler, do I still have to think about these types of manual loop optimizations, or will the compiler take care of them for me well enough?
If your profiler tells you there is a problem with a loop, and only then, a thing to watch out for is a memory reference in the loop which you know is invariant across the loop but the compiler does not. Here's a contrived example, bubbling an element out to the end of an array:
for ( ; i < a->length - 1; i++)
swap_elements(a, i, i+1);
You may know that the call to swap_elements does not change the value of a->length, but if the definition of swap_elements is in another source file, it is quite likely that the compiler does not. Hence it can be worthwhile hoisting the computation of a->length out of the loop:
int n = a->length;
for ( ; i < n - 1; i++)
swap_elements(a, i, i+1);
On performance-critical inner loops, my students get measurable speedups with transformations like this one.
Note that there's no need to hoist the computation of n-1; any optimizing compiler is perfectly capable of discovering loop-invariant computations among local variables. It's memory references and function calls that may be more difficult. And the code with n-1 is more manifestly correct.
As others have noted, you have no business doing any of this until you've profiled and have discovered that the loop is a performance bottleneck that actually matters.
Write the code, profile it, and only think about optimising it when you have found something that is not fast enough, and you can't think of an alternative algorithm that will reduce/avoid the bottleneck in the first place.
With modern compilers, this advice is even more important - if you write simple clean code, the compiler's optimiser can often do a better job of optimising the code than it can if you try to give it snazzy "pre-optimised" code.
Check the generated assembly and see for yourself. See if the computation for the loop-invariant code is being done inside the loop or outside the loop in the assembly code that your compiler generates. If it's failing to do the loop hoisting, do the hoisting yourself.
But as others have said, you should always profile first to find your bottlenecks. Once you've determined that this is in fact a bottleneck, only then should you check to see if the compiler's performing loop hoisting (aka loop-invariant code motion) in the hot spots. If it's not, help it out.
Compilers generally do an excellent job with this type of optimization, but they do miss some cases. Generally, my advice is: write your code to be as readable as possible (which may mean that you hoist loop invariants -- I prefer to read code written that way), and if the compiler misses optimizations, file bugs to help fix the compiler. Only put the optimization into your source if you have a hard performance requirement that can't wait on a compiler fix, or the compiler writers tell you that they're not going to be able to address the issue.
Where they are likely to be important to performance, you still have to think about them.
Loop hoisting is most beneficial when the value being hoisted takes a lot of work to calculate. If it takes a lot of work to calculate, it's probably a call out of line. If it's a call out of line, the latest version of gcc is much less likely than you are to figure out that it will return the same value every time.
Sometimes people tell you to profile first. They don't really mean it, they just think that if you're smart enough to figure out when it's worth worrying about performance, then you're smart enough to ignore their rule of thumb. Obviously, the following code might as well be "prematurely optimized", whether you have profiled or not:
#include <iostream>
bool isPrime(int p) {
for (int i = 2; i*i <= p; ++i) {
if ((p % i) == 0) return false;
}
return true;
}
int countPrimesLessThan(int max) {
int count = 0;
for (int i = 2; i < max; ++i) {
if (isPrime(i)) ++count;
}
return count;
}
int main() {
for (int i = 0; i < 10; ++i) {
std::cout << "The number of primes less than 1 million is: ";
std::cout << countPrimesLessThan(1000*1000);
std::cout << std::endl;
}
}
It takes a "special" approach to software development not to manually hoist that call to countPrimesLessThan out of the loop, whether you've profiled or not.
Early optimizations are bad only if other aspects - like readability, clarity of intent, or structure - are negatively affected.
If you have to declare it anyway, loop hoisting can even improve clarity, and it explicitely documents your assumption "this value doesn't change".
As a rule of thumb I wouldn't hoist the count/end iterator for a std::vector, because it's a common scenario easily optimized. I wouldn't hoist anything that I can trust my optimizer to hoist, and I wouldn't hoist anything known to be not critical - e.g. when running through a list of dozen windows to respond to a button click. Even if it takes 50ms, it will still appear "instanteneous" to the user. (But even that is a dangerous assumption: if a new feature requires looping 20 times over this same code, it suddenly is slow). You should still hoist operations such as opening a file handle to append, etc.
In many cases - very well in loop hoisting - it helps a lot to consider relative cost: what is the cost of the hoisted calculation compared to the cost of running through the body?
As for optimizations in general, there are quite some cases where the profiler doesn't help. Code may have very different behavior depending on the call path. Library writers often don't know their call path otr frequency. Isolating a piece of code to make things comparable can already alter the behavior significantly. The profiler may tell you "Loop X is slow", but it won't tell you "Loop X is slow because call Y is thrashing the cache for everyone else". A profiler couldn't tell you "this code is fast because of your snarky CPU, but it will be slow on Steve's computer".
A good rule of thumb is usually that the compiler performs the optimizations it is able to.
Does the optimization require any knowledge about your code that isn't immediately obvious to the compiler? Then it is hard for the compiler to apply the optimization automatically, and you may want to do it yourself
In most cases, lop hoisting is a fully automatic process requiring no high-level knowledge of the code -- just a lot of lifetime and dependency analysis, which is what the compiler excels at in the first place.
It is possible to write code where the compiler is unable to determine whether something can be hoisted out safely though -- and in those cases, you may want to do it yourself, as it is a very efficient optimization.
As an example, take the snippet posted by Steve Jessop:
for (int i = 0; i < 10; ++i) {
std::cout << "The number of primes less than 1 billion is: ";
std::cout << countPrimesLessThan(1000*1000*1000);
std::cout << std::endl;
}
Is it safe to hoist out the call to countPrimesLessThan? That depends on how and where the function is defined. What if it has side effects? It may make an important difference whether it is called once or ten times, as well as when it is called. If we don't know how the function is defined, we can't move it outside the loop. And the same is true if the compiler is to perform the optimization.
Is the function definition visible to the compiler? And is the function short enough that we can trust the compiler to inline it, or at least analyze the function for side effects? If so, then yes, it will hoist it outside the loop.
If the definition is not visible, or if the function is very big and complicated, then the compiler will probably assume that the function call can not be moved safely, and then it won't automatically hoist it out.
Remember 80-20 Rule.(80% of execution time is spent on 20% critical code in the program)
There is no meaning in optimizing the code which have no significant effect on program's overall efficiency.
One should not bother about such kind of local optimization in the code.So the best approach is to profile the code to figure out the critical parts in the program which consumes heavy CPU cycles and try to optimize it.This kind of optimization will really makes some sense and will result in improved program efficiency.