Visual Studio C++ compiler optimizations breaking code? - c++

I've a peculiar issue here, which is happening both with VS2005 and 2010. I have a for loop in which an inline function is called, in essence something like this (C++, for illustrative purposes only):
inline double f(int a)
{
if (a > 100)
{
// This is an error condition that shouldn't happen..
}
// Do something with a and return a double
}
And then the loop in another function:
for (int i = 0; i < 11; ++i)
{
double b = f(i * 10);
}
Now what happens is that in debug build everything works fine. In release build with all the optimizations turned on this is, according to disassembly, compiled so that i is used directly without the * 10 and the comparison a > 100 turns into a > 9, while I guess it should be a > 10. Do you have any leads as to what might make the compiler think that a > 9 is the correct way? Interestingly, even a minor change (a debug printout for example) in the surrounding code makes the compiler use i * 10 and compare that with the literal value of 100.
I know this is somewhat vague, but I'd be grateful for any old idea.
EDIT:
Here's a hopefully reproducable case. I don't consider it too big to be pasted here, so here goes:
__forceinline int get(int i)
{
if (i > 600)
__asm int 3;
return i * 2;
}
int main()
{
for (int i = 0; i < 38; ++i)
{
int j = (i < 4) ? 0 : get(i * 16);
}
return 0;
}
I tested this with VS2010 on my machine, and it seems to behave as badly as the original code I'm having problems with. I compiled and ran this with the IDE's default empty C++ project template, in release configuration. As you see, the break should never be hit (37 * 16 = 592). Note that removing the i < 4 makes this work, just like in the original code.

For anyone interested, it turned out to be a bug in the VS compiler. Confirmed by Microsoft and fixed in a service pack following the report.

First, it'd help if you could post enough code to allow us to reproduce the issue. Otherwise you're just asking for psychic debugging.
Second, it does occasionally happen that a compiler fails to generate valid code at the highest optimization levels, but more likely, you just have a bug somewhere in your code. If there is undefined behavior somewhere in your code, that means the assumptions made by the optimizer may not hold, and then the compiler can end up generating bad code.
But without seeing your actual code, I can't really get any more specific.

The only famous bug I know with optimization (and only with highest optimization level) are occasional modifications of the orders of priority of the operations (due to change of operations performed by the optimizer, looking for the fastest way to compute). You could look in this direction (and put some parenthesis even though they are not strictly speaking necessary, which is why more parenthesis is never bad), but frankly, those kind of bugs are quite rare.
As stated, it difficult to have any precise idea without more code.

Firstly, inline assembly prevents certain optimizations, you should use the __debugbreak() intrinsic for int3 breakpointing. The compiler sees the inline function having no effect other than a breakpoint, so it divides the 600 by 16(note: this is affected by integer truncation), thus it optimizes to debugbreak to trigger with 38 > i >= 37. So it seems to work on this end

Related

Will an operation done several times in sequence be simplified by compiler?

I've had this question for a long time but never knew where to look. If a certain operation is written many times will the compiler simplify it or will it run the exact same operation and get the exact same answer?
For example, in the following c-like pseudo-code (i%3)*10 is repeated many times.
for(int i=0; i<100; i++) {
array[(i%3)*10] = someFunction((i%3)*10);
int otherVar = (i%3)*10 + array[(i%3)*10];
int lastVar = (i%3)*10 - otherVar;
anotherFunction(lastVar);
}
I understand a variable would be better for visual purposes, but is it also faster? Is (i%3)*10 calculated 5 times per loop?
There are certain cases where I don't know if its faster to use a variable or just leave the original operation.
Edit: using gcc (MinGW.org GCC-8.2.0-3) 8.2.0 on win 10
Which optimizations are done depends on the compiler, the compiler optimization flag(s) you specify, and the architecture.
Here are a few possible optimizations for your example:
Loop Unrolling This makes the binary larger and thus is a trade-off; for example you may not want this on a tiny microprocessor with very little memory.
Common Subexpression Elimination (CSE) you can be pretty sure that your (i % 3) * 10 will only be executed once per loop iteration.
About your concern about visual clarity vs. optimization: When dealing with a 'local situation' like yours, you should focus on code clarity.
Optimization gains are often to be made at a higher level; for example in the algorithm you use.
There's a lot to be said about optimization; the above are just a few opening remarks. It's great that you're interested in how things work, because this is important for a good (C/C++) programmer.
As a matter of course, you should remove the obfuscation present in your code:
for (int i = 0; i < 100; ++i) {
int i30 = i % 3 * 10;
int r = someFunction(i30);
array[i30] = r;
anotherFunction(-r);
}
Suddenly, it looks quite a lot simpler.
Leave it to the compiler (with appropriate options) to optimize your code unless you find you actually have to take a hand after measuring.
In this case, unrolling three times looks like a good idea for the compiler to pursue. Though inlining might always reveal even better options.
Yes, operations done several times in sequence will be optimized by a compiler.
To go into more detail, all major compilers (GCC, Clang, and MSVC) store the value of (i%3)*10 into a temporary (scratch, junk) register, and then use that whenever an equivalent expression is used again.
This optimization is called GCSE (GNU Common Subexpression Elimination) for GCC, and just CSE otherwise.
This takes a decent chunk out of the time that it takes to compute the loop.

How to let my compiler more stupid (wrong index)?

I have encounter a horrible situation.
I usually use visual code to edit my code, also compile and execute in it(F5).
But I found vscode is too smart or ignore some warning message for me. And
output the right answer, which also work fine in Ideone. But in window cmd or dev C++ my code can't output anything, just return a big number.
And I find some situation will occur the thing I mention above.
The code like this
for (i = 0; i < g.size(); i++)
{
int source;
int dest;
int minWeight = 999;
for (j = 0; i < g[j].size(); j++)
{
// no edge, come to next condition
if (!g[i][j])
continue;
if (g[i][j] < minWeight)
{
source = i;
dest = j;
minWeight = g[i][j];
}
}
if
updateGroup(index, index[source], index[dest]);
else
updateGroup(index, index[source], index[dest]);
}
You may found that the second for loops have wrong condition statement, it should change
j = 0; i < g[j].size(); j++ to j = 0; j < g[i].size(); j++
So I wonder to know
Are there any way let my vscode more strict?
Why it still can output right answer in vscode and ideone?
How to avoid or be easier to found where my code wrong when this kind of no message error?
Really hope someone can help me, and appreciate all of your suggestion!!
There is no way for the compiler or computer to read your mind and guess what you meant to write instead of what you really did mean.
Even when this mistake results in a bug, it cannot know that you did not intend to write this, or that you meant to write some other specific thing instead.
Even when this bug results in your program having undefined behaviour, it is not possible to detect many cases of undefined behaviour, and it is not worthwhile for a compiler author to attempt to write code to do this, because it's too hard and not useful enough. Even if they did, the compiler could still not guess what you meant instead.
Remember, loops like this don't have to check or increment the same variable that you declared in the preamble; that's just a common pattern (now superseded by the safer ranged-for statement). There's nothing inherently wrong with having a loop that increments i but checks j.
Ultimately, the solution to this problem is to write tests for your code, which is why many organisations have dedicated Quality Assurance teams to search for bugs, and why you should already be testing your code before committing it to your project.
Remember to concentrate and pay close attention and read your code, and eventually such typos will become less common in your work. Of course once in a while you will write a bug, and your tests will catch it. Sometimes your tests won't catch it, which is when your customers will eventually notice it and raise a complaint. Then, you release a new version that fixes the bug.
This is all totally normal software development practice. That's what makes it fun! 😊
1) Crank up your compiler warnings as high as they will go.
2) Use multiple different compilers (they all warn about different things).
3) Know the details of the language well (a multi year effort) and be really, really careful about the code you write.
4) Write (and regularly run) lots of tests.
5) Use tools like sanitizers, fuzzers, linters, static code analyzers etc. to help catch bugs.
6) Build and run your code on multiple platforms to keep it portable and find bugs exposed by different environments/implementations.

g++ strict overflow, optimization, and warnings

When compiling the following with the strict overflow flag, it tells me, on the 2nd test that r may not be what I think it could be:
int32_t r(my_rand());
if(r < 0) {
r = -r;
if(r < 0) { // <-- error on this line
r = 0;
}
}
The error is:
/build/buildd/libqtcassandra-0.5.5/tests/cassandra_value.cpp:
In function 'int main(int, char**)':
/build/buildd/libqtcassandra-0.5.5/tests/cassandra_value.cpp:2341:13:
error: assuming signed overflow does not occur when simplifying
conditional to constant [-Werror=strict-overflow]
if(r < 0) {
^
What I do not understand is: why wouldn't the error be generated on the line before that? Because really the overflow happens when I do this, right?
r = -r;
EDIT: I removed my first answer, because it was invalid. Here is completely new version. Thanks to #Neil Kirk for pointing out my errors.
Answer for the question is here: https://stackoverflow.com/a/18521660/2468549
GCC always assumes, that signed overflow does never occur, and, on that assumption, it (always) optimizes out the inner if (r < 0) block.
If you turn -Wstrict-overflow on, then compiler finds out, that after r = -r r < 0 may still be true (if r == -2^31 initially), which causes an error (error is caused by optimization based on assumption of overflow never occurring, not by overflow possibility itself - that's how -Wstrict-overflow works).
A couple of additions in regard to having such overflows:
You can turn off these warnings with a #pragma GCC ...
If you are sure that your code will always work (wrapping of integers is fine in your function) then you can use a pragma around the offensive block or function:
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wstrict-overflow"
...offensive code here...
#pragma GCC diagnostic pop
Of course, this means you are completely ignoring the errors and let the compiler do its optimizations. Better have a test to make 100% sure that edge cases work as expected!
Use unsigned integers
The documentation says:
For C (and C++) this means that overflow when doing
arithmetic with signed numbers is undefined, which
means that the compiler may assume that it will not
happen.
In other words, if the math uses unsigned integers, this optimization doesn't apply. So if you need to do something such as r = -r, you can first copy r in an unsigned integer of the same size and then do the sign change:
std::int32_t r = ...;
std::uint32_t q = r;
q = -q;
if(static_cast<std::int32_t>(q) < 0) ...
That should work as expected: the if() statement will not be optimized out.
Turn off the actual optimization
Note: This was seemingly working in older compilers, but that is not currently possible on a per function basis. You have to use that command line option on the command line and not as an attribute. I leave this here for now as it may work for you.
I like this one better. I bumped in a test that would fail in Release mode today. Nothing that I use on a regular basis and only for a very specific edge case, but it worked just fine in Debug, therefore, not having the compiler optimize that function is a better option. You can actually do so as follow:
T __attribute__((optimize("-fno-strict-overflow"))) func(...)
{
...
}
That attribute cancels the -fstrict-overflow¹ error by actually emitted code as expected. My test now passes in Debug and Release.
Note: this is g++ specific. See your compiler documentation for equivalents, if available.
Either way, as mentioned by Frax, in -O3 mode, the compiler wants to optimize the test by removing it and the whole block of code. The whole block can be removed because if it were negative, after the r = -r;, then r is expected to be positive. So testing for a negative number again is optimized out and that's what the compiler is warning us about. However, with the -fstrict-overflow attribute, you instead ask the compiler to do that optimization in that one function. As a result you get the expected behavior in all cases, including overflows.
¹ I find that option name confusing. In this case, if you use -fstrict-overflow, you ask for the optimizer to do optimization even if strict overflows are not respected as a result.

Force compiler to not optimize side-effect-less statements

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)
Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.
But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:
void float_to_int(float f)
{
int i = static_cast<int>(f); // has no side-effects
}
Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.
The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.
I'm using Visual Studio 2008 / G++ (3.4.4).
Edit
To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.
Edit Again
To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.
We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.
So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.
Premature quoting of Knuth is the root of all annoyance.
Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:
static volatile int i = 0;
void float_to_int(float f)
{
i = static_cast<int>(f); // has no side-effects
}
So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements
You are by definition skewing the results.
Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.
In this case I suggest you make the function return the integer value:
int float_to_int(float f)
{
return static_cast<int>(f);
}
Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.
extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);
Now compare this to an empty function like:
int take_float_return_int(float /* f */)
{
return 1;
}
Which should also be external.
The difference in times should give you an idea of the expense of what you're trying to measure.
What always worked on all compilers I used so far:
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
note that this skews results, boith methods should write to writeMe.
volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:
extern volatile float readMe = 0;
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
int main()
{
readMe = 17;
float_to_int(readMe);
}
Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.
Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)
Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
As others have suggested, you will need to examine the assemnler output to check that this approach actually works.
You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.
These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.
a function call incurs quite a bit of overhead, so I would remove this anyway.
adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).
Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.
R
p.s. found this too:
inline float pslNegFabs32f(float x){
__asm{
fld x //Push 'x' into st(0) of FPU stack
fabs
fchs //change sign
fstp x //Pop from st(0) of FPU stack
}
return x;
}
supposedly also very fast. You might want to profile this too. (although it is hardly portable code)
Return the value?
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.
You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.
If you are using Microsoft's compiler - cl.exe, you can use the following statement to turn optimization on/off on a per-function level [link to doc].
#pragma optimize("" ,{ on |off })
Turn optimizations off for functions defined after the current line:
#pragma optimize("" ,off)
Turn optimizations back on:
#pragma optimize("" ,on)
For example, in the following image, you can notice 3 things.
Compiler optimizations flag is set - /O2, so code will get optimized.
Optimizations are turned off for first function - square(), and turned back on before square2() is defined.
Amount of assembly code generated for 1st function is higher. In second function there is no assembly code generated for int i = num; statement in code.
Thus while 1st function is not optimized, the second function is.
See https://godbolt.org/z/qJTBHg for link to this code on compiler explorer.
A similar directive exists for gcc too - https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html
A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.
GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.
I know the Intel compiler is also extremely good at these kinds of optimizations.
My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.
Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.
All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.
Check the AMD and Intel processor programming manuals for all the gritty details.

Is there any performance difference between for() and while()?

Or is it all about semantics?
Short answer: no, they are exactly the same.
Guess it could in theory depend on the compiler; a really broken one might do something slightly different but I'd be surprised.
Just for fun here are two variants that compile down to exactly the same assembly code for me using x86 gcc version 4.3.3 as shipped with Ubuntu. You can check the assembly produced on the final binary with objdump on linux.
int main()
{
#if 1
int i = 10;
do { printf("%d\n", i); } while(--i);
#else
int i = 10;
for (; i; --i) printf("%d\n", i);
#endif
}
EDIT: Here is an "oranges with oranges" while loop example that also compiles down to the same thing:
while(i) { printf("%d\n", i); --i; }
If your for and while loops do the same things, the machine code generated by the compiler should be (nearly) the same.
For instance in some testing I did a few years ago,
for (int i = 0; i < 10; i++)
{
...
}
and
int i = 0;
do
{
...
i++;
}
while (i < 10);
would generate exactly the same code, or (and Neil pointed out in the comments) with one extra jmp, which won't make a big enough difference in performance to worry about.
There is no semantic difference, there need not be any compiled difference. But it depends on the compiler. So I tried with with g++ 4.3.2, CC 5.5, and xlc6.
g++, CC were identical, xlc WAS NOT
The difference in xlc was in the initial loop entry.
extern int doit( int );
void loop1( ) {
for ( int ii = 0; ii < 10; ii++ ) {
doit( ii );
}
}
void loop2() {
int ii = 0;
while ( ii < 10 ) {
doit( ii );
ii++;
}
}
XLC OUTPUT
.loop2: # 0x00000000 (H.10.NO_SYMBOL)
mfspr r0,LR
stu SP,-80(SP)
st r0,88(SP)
cal r3,0(r0)
st r3,64(SP)
l r3,64(SP) ### DIFFERENCE ###
cmpi 0,r3,10
bc BO_IF_NOT,CR0_LT,__L40
...
enter code here
.loop1: # 0x0000006c (H.10.NO_SYMBOL+0x6c)
mfspr r0,LR
stu SP,-80(SP)
st r0,88(SP)
cal r3,0(r0)
cmpi 0,r3,10 ### DIFFERENCE ###
st r3,64(SP)
bc BO_IF_NOT,CR0_LT,__La8
...
The scope of the variable in the test of the while loop is wider than the scope of variables declared in the header of the for loop.
Therefore, if there are performance implications as a side-effect of keeping a variable alive longer, then there will be performance implications in choosing between a while and a for loop ( and not wrapping the while up in {} to reduce the scope of its variables ).
An example might be a concurrent collection which counts the number of iterators referring to it, and if more than one iterator exists, it applies locking to prevent concurrent modification, but as an optimisation elides the locking if only one iterator refers to it. If you then had two for loops in a function using differently named iterators on the same container, the fast path would be taken, but with two while loops the slow path would be taken. Similarly there may be performance implications if the objects are large (more cache traffic), or use system resources. But I can't think of a real example that I've ever seen where it would make a difference.
Compilers that optimize using loop unrolling will probably only do so in the for-loop case.
Both are equivalent. It's a matter of semantics.
The only difference may lie in the do... while construct, where you postpone the evaluation of the condition until after the body, and thus may save 1 evaluation.
i = 1; do { ... i--; } while( i > 0 );
as opposed to
for( i = 1; i > 0; --i )
{ ....
}
I write compilers. We compile all "structured" control flow (if, while, for, switch, do...while) into conditional and unconditional branches. Then we analyze the control-flow graph. Since a C compiler has to deal with general goto anyway, it is easiest to reduce everything to branch and conditional-branch instructions, then be sure to handle that case well. (A C compiler has to do a good job not just on handwritten code but also on automatically generated code, which may have many, many goto statements.)
No. If they're doing equivalent things, they'll compile to the same code - as you say, it's about semantics. Choose the one that best represents what you're trying to express.
Ideally it should be the same, but eventually it depends on your compiler/interpreter. To be sure, you must measure or examine the generated assembly code.
Proof that there may be a difference: These lines produce different assembly code using cc65.
for (; i < 1000; ++i);
while (i < 1000) ++i;
On Atmel ATMega while() is faster than for(). Why is this is explained in AVR035: Efficient C Coding for AVR.
P.S. Original platform was not mentioned in question.
continue behaves differently in for and while: in for, it alters the counter, in while, it usually doesn't
To add another answer: In my experience, optimizing software is like a big, bushy beard being shaved off a man.
First you lop it off in big chunks with scissors (prune whole limbs off the call tree).
Then you make it short with an electric clipper (tweak algorithms).
Finally you shave it with a razor to get rid of the last little bit (low-level optimization).
The last is where the difference between for() and while() might, but probably won't, make a difference.
P.S. The programmers I know (who are all very good, and I suspect are a representative sample) basically go at it from the other direction.
They are the same as far as performance goes. I tend to use while when waiting for a state change (such as waiting for a buffer to be filled) and for when processing a number of discrete objects (such as going through each item in a collection).
There is a difference in some cases.
If you are at the point where that difference matters, you either need to pick a better algorithm or begin coding in assembly language. Trust me, coding in assembly is preferable to fixing your compiler version.
Is while() faster/slower than for()? Let's review a few things about optimization:
Compiler-writers work very hard to shave cycles by having fewer calls to jump, compare, increment, and the other kinds of instructions that they generate.
Call instructions, on the other hand, consume many magnitudes more cycles, but the compiler is nearly powerless to do anything to remove those.
As programmers, we write lots of function calls, some because we mean to, some because we're lazy, and some because the compiler slips them in without being obvious.
Most of the time, it doesn't matter, because the hardware is so fast, and our jobs are so small, that the computer is like a beagle dog who wolfes her food and begs for more.
Sometimes, however, the job is big enough that performance is an issue.
What do we do then? Where's the bigger payoff?
Getting the compiler to shave a few cycles off loops & such?
Finding function calls that don't -really- need to be done so much?
The compiler can't do the latter. Only we the programmers can.
We need to learn or be taught how to do this. It doesn't come naturally.
We are congenitally inclined to make wrong guesses and then bet on them.
Getting better algorithms is a start, but only a start. Our teachers need to teach this, if indeed they know how.
Profilers are a start. I do this.
The apocryphal quote of Willie Sutton when asked Why do you rob banks?:
Because that's where the money is.
If you want to save cycles, find out where they are.
Probably only coding style.
for if you know the number of iterations.
while if you do not know the number of iterations.