What is faster: compare then change, or change immediately? - c++

Let I'm doing very fast loops and I have to be sure that in the end of each loop the variable a is SOMEVALUE. What will be faster?
if (a != SOMEVALUE) a = SOMEVALUE;
or just instantly do
a = SOMEVALUE;
Is it float/int/bool/language specific?
Update: a is a primitive type, not a class. And the possibility of TRUE comparison is 50%. I know that the algorithm is what makes a loop fast, so my question is also about the coding style.
Update2: thanks everyone for quick answers!

In almost all cases just setting the value will be faster.
It might not be faster when you have to deal with cache line sharing with other cpus or if 'a' is in some special type of memory, but it's safe to assume that a branch misprediction is probably a more common problem than cache sharing.
Also - smaller code is better, not just for the cache but also for making the code comprehensible.
If in doubt - profile.

The general answer is to profile such kind of questions. However, in this case a simple analysis is available:
Each test is a branch. Each branch incurs a slight performance penalty. However, we have branch prediction and this penalty is somewhat amortized in time, depending how many iterations your loop has and how many times the prediction was correct.
Translated into your case, if you have many changes to a during the loop it is very likely that the code using if will be worse in performance. On the other hand, if the value is updated very rarely there would be an infinitely small difference between the two cases.
Still, change immediately is better and should be used, as long as you don't care about the previous value, as your snippets show.
Other reasons for an immediate change: it leads to smaller code thus better cache locality, thus better code performance. It is a very rare situation in which updating a will invalidate a cache line and incur a performance hit. Still, if I remember correctly, this will byte you only on multi processor cases and very rarely.
Keep in mind that there are cases when the two are not similar. Comparing NaNs is undefined behaviour.
Also, this comment treats only the case of C. In C++ you can have classes where the assignment operator / copy constructor takes longer than testing for equality. In that case, you might want to test first.
Taking into account your update, it's better to simply use assignment as long as you're sure of not dealing with undefined behaviour (floats). Coding-style wise it is also better, easier to read.

You should profile it.
My guess would be that there is little difference, depending on how often the test is true (this is due to branch-prediction).
Of course, just setting it has the smallest absolute code size, which frees up instruction cache for more interesting code.
But, again, you should profile it.

I would be surprised is the answer wasn't a = somevalue, but there is no generic answer to this question. Firslty it depends on the speed of copy versus the speed of equality comparison. If the equality comparison is very fast then your first option may be better. Secondly, as always, it depends on your compiler/platform. The only way to answer such questions is to try both methods and time them.

As others have said, profiling it is going to be the easiest way to tell as it depends a lot on what kind of input you're throwing at it. However, if you think about the computational complexity of the two algorithms, the more input you throw at it, the smaller any possible difference of them becomes.

As you are asking this for a C++ program, I assume that you are compiling the code into native machine instructions.
Assigning the value directly without any comparison should be much faster in any case. To compare the values, both the values a and SOMEVALUE should be transferred to registers and one machine instruction cmp() has to be executed.
But in the later case where you assign directly, you just move one value from one memory location to another.
Only way the assignment can be slower is when memory writes are significantly costlier than memory reads. I don't see that happening.

Profile the code. Change accordingly.
For basic types, the no branch option should be faster. MSVS for example doesn't optimize the branch out.
That being said, here's an example of where the comparison version is faster:
struct X
{
bool comparisonDone;
X() : comparisonDone(false) {}
bool operator != (const X& other) { comparisonDone = true; return true; }
X& operator = (const X& other)
{
if ( !comparisonDone )
{
for ( int i = 0 ; i < 1000000 ; i++ )
cout << i;
}
return *this;
}
}
int main()
{
X a;
X SOMEVALUE;
if (a != SOMEVALUE) a = SOMEVALUE;
a = SOMEVALUE;
}

Change immediately is usually faster, as it involves no branch in the code.
As commented below and answered by others, it really depends on many variables, but IMHO the real question is: do you care what was the previous value? If you are, you should check, otherwise, you shouldn't.

That if can actually be 'optimized away' by some compilers, basically turning the if into code noise (for the programmer who's reading it).
When I compile the following function with GCC for x86 (with -O1, which is a pretty reasonable optimization level):
int foo (int a)
{
int b;
if (b != a)
b = a;
b += 5;
return b;
}
GCC just 'optimizes' the if and the assignment away, and simply uses the argument to do the addition:
foo:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
popl %ebp
addl $5, %eax
ret
.ident "GCC: (GNU) 4.4.3"
Having or not having the if generates exact the same code.

Related

Is comparing to zero faster than comparing to any other number?

Is
if(!test)
faster than
if(test==-1)
I can produce assembly but there is too much assembly produced and I can never locate the particulars I'm after. I was hoping someone just knows the answer. I would guess they are the same unless most CPU architectures have some sort of "compare to zero" short cut.
thanks for any help.
Typically, yes. In typical processors testing against zero, or testing sign (negative/positive) are simple condition code checks. This means that instructions can be re-ordered to omit a test instruction. In pseudo assembly, consider this:
Loop:
LOADCC r1, test // load test into register 1, and set condition codes
BCZS Loop // If zero was set, go to Loop
Now consider testing against 1:
Loop:
LOAD r1, test // load test into register 1
SUBT r1, 1 // Subtract Test instruction, with destination suppressed
BCNE Loop // If not equal to 1, go to Loop
Now for the usual pre-optimization disclaimer: Is your program too slow? Don't optimize, profile it.
It depends.
Of course it's going to depend, not all architectures are equal, not all µarchs are equal, even compilers aren't equal but I'll assume they compile this in a reasonable way.
Let's say the platform is 32bit x86, the assembly might look something like
test eax, eax
jnz skip
Vs:
cmp eax, -1
jnz skip
So what's the difference? Not much. The first snippet takes a byte less. The second snippet might be implemented with an inc to make it shorter, but that would make it destructive so it doesn't always apply, and anyway, it's probably slower (but again it depends).
Take any modern Intel CPU. They do "macro fusion", which means they take a comparison and a branch (subject to some limitations), and fuse them. The comparison becomes essentially free in most cases. The same goes for test. Not inc though, but the inc trick only really applied in the first place because we just happened to compare to -1.
Apart from any "weird effects" (due to changed alignment and whatnot), there should be absolutely no difference on that platform. Not even a small difference.
Even if you got lucky and got the test for free as a result of a previous arithmetic instruction, it still wouldn't be any better.
It'll be different on other platforms, of course.
On x86 there won't be any noticeably difference, unless you are doing some math at the same time (e.g. while(--x) the result of --x will automatically set the condition code, where while(x) ... will necessitate some sort of test on the value in x before we know if it's zero or not.
Many other processors do have a "automatic updates of the condition codes on LOAD or MOVE instructions", which means that checking for "postive", "negative" and "zero" is "free" with every movement of data. Of course, you pay for that by not being able to backward propagate the compare instruction from the branch instruction, so if you have a comparison, the very next instruction MUST be a conditional branch - where an extra instruction between these would possibly help with alleviating any delay in the "result" from such an instruction.
In general, these sort of micro-optimisations are best left to compilers, rather than the user - the compiler will quite often convert for(i = 0; i < 1000; i++) into for(i = 1000-1; i >= 0; i--) if it thinks that makes sense [and the order of the loop isn't important in the compiler's view]. Trying to be clever with these sort of things tend to make the code unreadable, and performance can suffer badly on other systems (because when you start tweaking "natural" code to "unnatural", the compiler tends to think that you really meant what you wrote, and not optimise it the same way as the "natural" version).

What is efficient to check: equal to or not equal to?

I was wondering, if we have if-else condition, then what is computationally more efficient to check: using the equal to operator or the not equal to operator? Is there any difference at all?
E.g., which one of the following is computationally efficient, both cases below will do same thing, but which one is better (if there's any difference)?
Case1:
if (a == x)
{
// execute Set1 of statements
}
else
{
// execute Set2 of statements
}
Case 2:
if (a != x)
{
// execute Set2 of statements
}
else
{
// execute Set1 of statements
}
Here assumptions are most of the time (say 90% of the cases) a will be equal to x. a and x both are of unsigned integer type.
Generally it shouldn't matter for performance which operator you use. However it is recommended for branching that the most likely outcome of the if-statement comes first.
Usually what you should consider is; what is the simplest and clearest way to write this code? IMHO, the first, positive is the simplest (not requiring a !)
In terms of performance there is no differences as the code is likely to compile to the same thing. (Certainly in the JIT for Java it should)
For Java, the JIT can optimise the code so the most common branch is preferred by the branch prediction.
In this simple case, it makes no difference. (assuming a and x are basic types) If they're class-types with overloaded operator == or operator != they might be different, but I wouldn't worry about it.
For subsequent loops:
if ( c1 ) { }
else if ( c2 ) { }
else ...
the most likely condition should be put first, to prevent useless evaluations of the others. (again, not applicable here since you only have one else).
GCC provides a way to inform the compiler about the likely outcome of an expression:
if (__builtin_expect(expression, 1))
…
This built-in evaluates to the value of expression, but it informs the compiler that the likely result is 1 (true for Booleans). To use this, you should write expression as clearly as possible (for humans), then set the second parameter to whichever value is most likely to be the result.
There is no difference.
The x86 CPU architecture has two opcodes for conditional jumps
JNE (jump if not equal)
JE (jump if equal)
Usually they both take the same number of CPU cycles.
And even when they wouldn't, you could expect the compiler to do such trivial optimizations for you. Write what's most readable and what makes your intention more clear instead of worrying about microseconds.
If you ever manage to write a piece of Java code that can be proven to be significantly more efficient one way than the other, you should publish your result and raise an issue against whatever implementation you observed the difference on.
More to the point, just asking this kind of question should be a sign of something amiss: it is an indication that you are focusing your attention and efforts on a wrong aspect of your code. Real-life application performance always suffers from inadequate architecture; never from concerns such as this.
Early optimization is the root of all evil
Even for branch prediction, I think you should not care too much about this, until it is really necessary.
Just as Peter said, use the simplest way.
Let the compiler/optimizer do its work.
It's a general rule of thumb (most nowadays) that the source code should express your intention in the most readable way. You are writing it to another human (and not to the computer), the one year later yourself or your team mate who will need to understand your code with the less effort.
It shouldn't make any difference performance wise but you consider what is easiest to read. Then when you are looking back on your code or if someone is looking at it, you want it to be easy to understand.
it has a little advantage (from point of readability) if the first condition is the one that is true in most cases.
Write the conditions that way that you can read them best. You will not benefit from speed by negating a condition
Most processors use an electrical gate for equality/inequality checks, this means all bits are checked at once. Therefore it should make no difference, but you want to truly optimise your code it is always better to benchmark things yourself and check the results.
If you are wondering whether it's worth it to optimise like that, imagine you would have this check multiple times for every pixel in your screen, or scenarios like that. Imho, it is alwasy worth it to optimise, even if it's only to teach yourself good habits ;)
Only the non-negative approach which you have used at the first seems to be the best .
The only way to know for sure is to code up both versions and measure their performance. If the difference is only a percent or so, use the version that more clearly conveys the intent.
It's very unlikely that you're going to see a significant difference between the two.
Performance difference between them is negligible. So, just think about readability of the code. For readability I prefer the one which has a more lines of code in the If statement.
if (a == x) {
// x lines of code
} else {
// y lines of code where y < x
}

Is x >= 0 more efficient than x > -1?

Doing a comparison in C++ with an int is x >= 0 more efficient than x > -1?
short answer: no.
longer answer to provide some educational insight: it depends entirely on your compiler, allthough i bet that every sane compiler creates identical code for the 2 expressions.
example code:
int func_ge0(int a) {
return a >= 0;
}
int func_gtm1(int a) {
return a > -1;
}
and then compile and compare the resulting assembler code:
% gcc -S -O2 -fomit-frame-pointer foo.cc
yields this:
_Z8func_ge0i:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
movl 4(%esp), %eax
notl %eax
shrl $31, %eax
ret
.cfi_endproc
vs.
_Z9func_gtm1i:
.LFB1:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
movl 4(%esp), %eax
notl %eax
shrl $31, %eax
ret
.cfi_endproc
(compiler: g++-4.4)
conclusion: don't try to outsmart the compiler, concentrate on algorithms and data structures, benchmark and profile real bottlenecks, if in doubt: check the output of the compiler.
You can look at the resulting assembly code, which may differ from architecture to architecture, but I would bet that the resulting code for either would require exactly the same cycles.
And, as mentioned in the comments - better write what's most comprehensible, optimize when you have real measured bottlenecks, which you can identify with a profiler.
BTW: Rightly mentioned, that x>-1 may cause problems if x is unsigned. It may be implicitly cast into signed (although you should get a warning on that), which would yield incorrect result.
The last time I answered such a question I just wrote "measure", and filled out with periods until SO accepted it.
That answer was downvoted 3 times in a few minutes, and deleted (along with at least one other answer of the question) by an SO moderator.
Still, there is no alternative to measuring.
So it is the only possible answer.
And in order to go on and on about this in sufficient detail that the answer is not just downvoted and deleted, you need to keep in mind that what you're measuring is just that: that a single set of measurements does not necessarily tell you anything in general, but just a specific result. Of course it might sound patronizing to mention such obvious things. So, OK, let that be it: just measure.
Or, should I perhaps mention that most processors have a special instruction for comparing against zero, and yet that that does not allow one to conclude anything about performance of your code snippets?
Well, I think I stop there. Remember: measure. And don't optimize prematurely!
EDIT: an amendment with the points mentioned by #MooingDuck in the commentary.
The question:
Doing a comparison in C++ with an int is x >= 0 more efficient than x > -1?
What’s wrong with the question
Donald Knuth, author of the classic three volume work The Art of Computer Programming, once wrote[1],
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil”
How efficient x >= 0 is compared to x > -1 is most often irrelevant. I.e. it’s most likely a wrong thing to focus on.
How clearly it expresses what you want to say, is much more important. Your time and the time of others maintaining this code is generally much more important than the execution time of the program. Focus on how well the code communicates to other programmers, i.e., focus on clarity.
Why the focus of the question is wrong
Clarity affects the chance of correctness. Any code can be made arbitrarily fast if it does not need to be correct. Correctness is therefore most important, and means that clarity is very important – much more important than shaving a nano-second of execution time…
And the two expressions are not equivalent wrt. clarity, and wrt. to their chance of being correct.
If x is a signed integer, then x >= 0 means exactly the same as x > -1. But if x is an unsigned integer, e.g. of type unsigned, then x > -1 means x > static_cast<unsigned>(-1) (via implicit promotion), which in turn means x > std::numeric_limits<unsigned>::max(). Which is presumably not what the programmer meant to express!
Another reason why the focus is wrong (it’s on micro-efficiency, while it should be on clarity) is that the main impact on efficiency comes in general not from timings of individual operations (except in some cases from dynamic allocation and from the even slower disk and network operations), but from algorithmic efficiency. For example, writing …
string s = "";
for( int i = 0; i < n; ++i ) { s = s + "-"; }
is pretty inefficient, because it uses time proportional to the square of n, O(n2), quadratic time.
But writing instead …
string s = "";
for( int i = 0; i < n; ++i ) { s += "-"; }
reduces the time to proportional to n, O(n), linear time.
With the focus on individual operation timings one could be thinking now about writing '-' instead of "-", and such silly details. Instead, with the focus on clarity, one would be focusing on making that code more clear than with a loop. E.g. by using the appropriate string constructor:
string s( n, '-' );
Wow!
Finally, a third reason to not sweat the small stuff is that in general it’s just a very small part of the code that contributes disproportionally to the execution time. And identifying that part (or parts) is not easy to do by just analyzing the code. Measurements are needed, and this kind of "where is it spending its time" measurement is called profiling.
How to figure out the answer to the question
Twenty or thirty years ago one could get a reasonable idea of efficiency of individual operations, by simply looking at the generated machine code.
For example, you can look at the machine code by running the program in a debugger, or you use the approiate option to ask the compiler to generate an assembly language listing. Note for g++: the option -masm=intel is handy for telling the compiler not to generate ungrokkable AT&T syntax assembly, but instead Intel syntax assembly. E.g., Microsoft's assembler uses extended Intel syntax.
Today the computer's processor is more smart. It can execute instructions out of order and even before their effect is needed for the "current" point of execution. The compiler may be able to predict that (by incorporating effective knowledge gleaned from measurements), but a human has little chance.
The only recourse for the ordinary programmer is therefore to measure.
Measure, measure, measure!
And in general this involves doing the thing to be measured, a zillion times, and dividing by a zillion.
Otherwise the startup time and take-down time will dominate, and the result will be garbage.
Of course, if the generated machine code is the same, then measuring will not tell you anything useful about the relative difference. It can then only indicate something about how large the measurement error is. Because you know then that there should be zero difference.
Why measuring is the right approach
Let’s say that theoretical considerations in an SO answer indicated that x >= -1 will be slower than x > 0.
The compiler can beat any such theoretical consideration by generating awful code for that x > 0, perhaps due to a contextual "optimization" opportunity that it then (unfortunately!) recognizes.
The computer's processor can likewise make a mess out of the prediction.
So in any case you then have to measure.
Which means that the theoretical consideration has told you nothing useful: you’ll be doing the same anyway, namely, measuring.
Why this elaborated answer, while apparently helpful, is IMHO really not
Personally I would prefer the single word “measure” as an answer.
Because that’s what it boils down to.
Anything else the reader not only can figure out on his own, but will have to figure out the details of anyway – so that it’s just verbiage to try to describe it here, really.
References:
[1] Knuth, Donald. Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268.
Your compiler is free to decide how to implement those (which assembly instructions to use). Because of that, there is no difference. One compiler could implement x > -1 as x >= 0 and another could implement x >= 0 as x > -1. If there is any difference (unlikely), your compiler will pick the better one.
They should be equivalent. Both will be translated by the compiler into a single assembly instruction (neglecting that both will need to load x into a register). On any modern day processor there is a 'greater-than' instruction and a 'greater-than-or-equal' instruction. And since you are comparing it to a constant value, it will take the same amount of time.
Don't fret over the minute details, find the big performance problems (like algorithm design) and attack those, look at Amdahls Law.
I doubt there is any measurable difference. The compiler should emit some assembled code with a jump instruction such as JAE (jump if above or equal) or JA (jump if above). These instructions likely span the same number of cycles.
Ultimately, it doesn't matter. Just use what is more clear to a human reader of your code.

rate ++a,a++,a=a+1 and a+=1 in terms of execution efficiency in C.Assume gcc to be the compiler [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Is there a performance difference between i++ and ++i in C++?
In terms of usage of the following, please rate in terms of execution time in C.
In some interviews i was asked which shall i use among these variations and why.
a++
++a
a=a+1
a+=1
Here is what g++ -S produces:
void irrelevant_low_level_worries()
{
int a = 0;
// movl $0, -4(%ebp)
a++;
// incl -4(%ebp)
++a;
// incl -4(%ebp)
a = a + 1;
// incl -4(%ebp)
a += 1;
// incl -4(%ebp)
}
So even without any optimizer switches, all four statements compile to the exact same machine code.
You can't rate the execution time in C, because it's not the C code that is executed. You have to profile the executable code compiled with a specific compiler running on a specific computer to get a rating.
Also, rating a single operation doesn't give you something that you can really use. Todays processors execute several instructions in parallel, so the efficiency of an operation relies very much on how well it can be paired with the instructions in the surrounding code.
So, if you really need to use the one that has the best performance, you have to profile the code. Otherwise (which is about 98% of the time) you should use the one that is most readable and best conveys what the code is doing.
The circumstances where these kinds of things actually matter is very rare and few in between. Most of the time, it doesn't matter at all. In fact I'm willing to bet that this is the case for you.
What is true for one language/compiler/architecture may not be true for others. And really, the fact is irrelevant in the bigger picture anyway. Knowing these things do not make you a better programmer.
You should study algorithms, data structures, asymptotic analysis, clean and readable coding style, programming paradigms, etc. Those skills are a lot more important in producing performant and manageable code than knowing these kinds of low-level details.
Do not optimize prematurely, but also, do not micro-optimize. Look for the big picture optimizations.
This depends on the type of a as well as on the context of execution. If a is of a primitive type and if all four statements have the same identical effect then these should all be equivalent and identical in terms of efficiency. That is, the compiler should be smart enough to translate them into the same optimized machine code. Granted, that is not a requirement, but if it's not the case with your compiler then that is a good sign to start looking for a better compiler.
For most compilers it should compile to the same ASM code.
Same.
For more details see http://www.linux-kongress.org/2009/slides/compiler_survey_felix_von_leitner.pdf
I can't see why there should be any difference in execution time, but let's prove me wrong.
a++
and
++a
are not the same however, but this is not related to efficiency.
When it comes to performance of individual lines, context is always important, and guessing is not a good idea. Test and measure is better
In an interview, I would go with two answers:
At first glance, the generated code should be very similar, especially if a is an integer.
If execution time was definitely a known problem - you have to measure it using some kind of profiler.
Well, you could argue that a++ is short and to the point. It can only increment a by one, but the notation is very well understood. a=a+1 is a little more verbose (not a big deal, unless you have variablesWithGratuitouslyLongNames), but some might argue it's more "flexible" because you can replace the 1 or either of the a's to change the expression. a+=1 is maybe not as flexible as the other two but is a little more clear, in the sense that you can change the increment amount. ++a is different from a++ and some would argue against it because it's not always clear to people who don't use it often.
In terms of efficiency, I think most modern compilers will produce the same code for all of these but I could be mistaken. Really, you'd have to run your code with all variations and measure which performs best.
(assuming that a is an integer)
It depends on the context, and if we are in C or C++. In C the code you posted (except for a-- :-) will cause a modern C compiler to produce exactly the same code. But by a very high chance the expected answer is that a++ is the fastest one and a=a+1 the slowest, since ancient compilers relied on the user to perform such optimizations.
In C++ it depends of the type of a. When a is a numeric type, it acts the same way as in C, which means a++, a+=1 and a=a+1 generate the same code. When a is a object, it depends if any operator (++, + and =) is overloaded, since then the overloaded operator of the a object is called.
Also when you work in a field with very special compilers (like microcontrollers or embedded systems) these compilers can behave very differently on each of these input variations.

Why is this code slower even if the function is inlined?

I have a method like this :
bool MyFunction(int& i)
{
switch(m_step)
{
case 1:
if (AComplexCondition)
{
i = m_i;
return true;
}
case 2:
// some code
case 3:
// some code
}
}
Since there are lots of case statements (more than 3) and the function is becoming large, I tried to extract the code in case 1 and put it in an inline function like this:
inline bool funct(int& i)
{
if (AComplexCondition)
{
i = m_i;
return true;
}
return false;
}
bool MyFunction(int& i)
{
switch(m_step)
{
case 1:
if (funct(i))
{
return true;
}
case 2:
// some code
case 3:
// some code
}
}
It seems this code is significantly slower than the original. I checked with -Winline and the function is inlined. Why is this code slower? I thought it would be equivalent. The only difference I see is there is one more conditional check in the second version, but I thought the compiler should be able to optimize it away. Right?
edit:
Some peoples suggested that I should use gdb to stop over every assembly instructions in both versions to see the differences. I did this.
The first version look like this :
mov
callq (Call to AComplexCondition())
test
je (doesn't jump)
mov (i = m_i)
movl (m_step = 1)
The second version, that is a bit slower seems simpler.
movl (m_step = 1)
callq (Call to AComplexCondition())
test
je (doesn't jump)
mov (i = m_i)
xchg %ax,%ax (This is a nop I think)
These two version seems to do the same thing, so I still don't know why the second version is still slower.
Just step through it. Plant a breakpoint, go into the disassembly view, and start stepping.
All mysteries will vanish.
This is very hard to track down. One problem could be code bloat causing the majority of the loop to be pushed out of the (small) CPU cache... But that doesn't entirely make sense either now that I think of it..
What I suggest doing:
Isolate the code and condition as much as possible while still being able to observe the slowdown.
Then, go profile it. Does the profiling make sense? Now, (assuming your up for the adventure) disasssemble the code and look at what g++ is doing different. Report those results back here
GMan is correct, inline doesn't guarantee that your function will be inlined. It is a hint to the compiler that it might be a good idea. If the compiler doesn't think it is wise to inline the function, you now have the overhead of a function call. Which at the very least will mean two JMP statement being executed. Which means the instructions for the function are stored in a non sequential location, not in the next memory location where the function was invoked, and execution will move that new location complete it and move back to after your function call.
Without seeing the ComplexCondition part, it's hard to say. If that condition is sufficiently complex, the compiler won't be able to pipeline it properly and it will interfere with the branch prediction in the chip. Just a possibility.
Does the assembler tell you anything about what's happening? It might be easier to look at the disassembly than to have us guess, although I go along with iaimtomisbehave's jmp idea generally.
This is a good question. Let us know what you find. I do have a few thoughts mostly stemming from the compiler no longer being able to break up the code you have inlined, but no guaranteed answer.
statement order. It makes sense that the compiler would put this statement with its complex code last. That means the other cases would be evaluated first and it would never get checked unless necessary. If you simplify the statement it might not do this, meaning your crazy conditional gets fully evaluated every time.
creating extra cases. It should be possible to pull some of the coditionals out of the if statement and make an extra case stament in some circumstances. That could eliminate some checking.
pipelining defeated. Even if it inlines, it won't be able to break up the code inside the actuall inlining any. This is the basic issue with all three of these, but with pipelining this causes problems obviously since for pipelining you want to start executing before you get to the check itself.