Why is this code slower even if the function is inlined? - c++

I have a method like this :
bool MyFunction(int& i)
{
switch(m_step)
{
case 1:
if (AComplexCondition)
{
i = m_i;
return true;
}
case 2:
// some code
case 3:
// some code
}
}
Since there are lots of case statements (more than 3) and the function is becoming large, I tried to extract the code in case 1 and put it in an inline function like this:
inline bool funct(int& i)
{
if (AComplexCondition)
{
i = m_i;
return true;
}
return false;
}
bool MyFunction(int& i)
{
switch(m_step)
{
case 1:
if (funct(i))
{
return true;
}
case 2:
// some code
case 3:
// some code
}
}
It seems this code is significantly slower than the original. I checked with -Winline and the function is inlined. Why is this code slower? I thought it would be equivalent. The only difference I see is there is one more conditional check in the second version, but I thought the compiler should be able to optimize it away. Right?
edit:
Some peoples suggested that I should use gdb to stop over every assembly instructions in both versions to see the differences. I did this.
The first version look like this :
mov
callq (Call to AComplexCondition())
test
je (doesn't jump)
mov (i = m_i)
movl (m_step = 1)
The second version, that is a bit slower seems simpler.
movl (m_step = 1)
callq (Call to AComplexCondition())
test
je (doesn't jump)
mov (i = m_i)
xchg %ax,%ax (This is a nop I think)
These two version seems to do the same thing, so I still don't know why the second version is still slower.

Just step through it. Plant a breakpoint, go into the disassembly view, and start stepping.
All mysteries will vanish.

This is very hard to track down. One problem could be code bloat causing the majority of the loop to be pushed out of the (small) CPU cache... But that doesn't entirely make sense either now that I think of it..
What I suggest doing:
Isolate the code and condition as much as possible while still being able to observe the slowdown.
Then, go profile it. Does the profiling make sense? Now, (assuming your up for the adventure) disasssemble the code and look at what g++ is doing different. Report those results back here

GMan is correct, inline doesn't guarantee that your function will be inlined. It is a hint to the compiler that it might be a good idea. If the compiler doesn't think it is wise to inline the function, you now have the overhead of a function call. Which at the very least will mean two JMP statement being executed. Which means the instructions for the function are stored in a non sequential location, not in the next memory location where the function was invoked, and execution will move that new location complete it and move back to after your function call.

Without seeing the ComplexCondition part, it's hard to say. If that condition is sufficiently complex, the compiler won't be able to pipeline it properly and it will interfere with the branch prediction in the chip. Just a possibility.

Does the assembler tell you anything about what's happening? It might be easier to look at the disassembly than to have us guess, although I go along with iaimtomisbehave's jmp idea generally.

This is a good question. Let us know what you find. I do have a few thoughts mostly stemming from the compiler no longer being able to break up the code you have inlined, but no guaranteed answer.
statement order. It makes sense that the compiler would put this statement with its complex code last. That means the other cases would be evaluated first and it would never get checked unless necessary. If you simplify the statement it might not do this, meaning your crazy conditional gets fully evaluated every time.
creating extra cases. It should be possible to pull some of the coditionals out of the if statement and make an extra case stament in some circumstances. That could eliminate some checking.
pipelining defeated. Even if it inlines, it won't be able to break up the code inside the actuall inlining any. This is the basic issue with all three of these, but with pipelining this causes problems obviously since for pipelining you want to start executing before you get to the check itself.

Related

Which is the more efficient dispatch method for a virtual machine?

Which would be a more efficient dispatch method for making my fetch-decode-execute times that little bit quicker?
For simplicity, I've kept this to a minimum, like operations operate on 1 byte operands and there are only two, for example.
The method I'm using at the moment (simplified) is:
typedef unsigned char byte;
vector<byte> _program = { INST::PUSH, 32, INST::POP};
enum INST {
PUSH =0, /*index=0*/
POP =1, /*index=1*/
}
//DISPATCHING METHOD #1
switch (curr_instruction) {
case INST::PUSH: {
/*declared inline*/ _push_to_stack(_program[instr_ptr+1]);
}
case INST::POP: {
/*declared inline*/ _pop_stack();
}
}
OR using a function pointer table to execute each instruction in the 'program' (vector of bytes/ vector _program), like so:
typedef void (*voidptr)();
void hndl_push(){
/*declared inline*/ _push_to_stack(_program[instr_ptr+1]);
}
void hndl_push(){
/*declared inline*/ _pop_stack();
}
funcptr handlers[2] = {&hndl_push /*index=0*/, & hdnl_pop /*index=1*/}'
vector<byte> _program = { INST::PUSH, 32, INST::POP};
size_t instr_ptr=0;
//DISPATCHING METHOD #2
while (instr_ptr != _program.size()){
instr_ptr++;
_handlers[instr_ptr]();
}
I am using the VC++ (Visual Studio) compiler, the 2015 version.
Which of these is converted into more efficient assembler with the least overhead, or are they the same?
Thank you in advance!
The only way to know which would be faster is to measure.
The optimizer may be able to do quite a bit with either technique. Dense switch statements are often reduced to a jump table, and the function calls may be inlined, so that could be the fastest approach.
But if, for whatever reason, the optimizer cannot inline or if the switch statement becomes a cascaded of if-else statements, then the function pointer calls may be faster.
Wikipedia has a decent article on threaded code, which describes various techniques to dispatch opcodes in a virtual machine or interpreter.
How could the second solution possibly be quicker than the first, but best the compiler could convert the second into the first anyway.
As a side note you need to change the program pointer dependant on the opcode.
Indirect branch prediction is hard, but at least there's only the one unconditional branch. The branch predictor needs to correctly predict the branch target address to be able to keep the pipeline fed.
However, an unpredictable conditional branch is bad, too. With enough cases, a single indirect branch will do better than multiple conditional branches, so that's what compilers do under the hood. With just two cases, you will almost certainly get better results from letting the compiler choose how to implement the switch.
Conditional branch predictors in some CPUs might be better at recognizing simple (but non-trivial) patterns than indirect branch predictors, but Intel SnB-family CPUs at least can recognize some target-address patterns for indirect branches. (Agner Fog's microarch pdf has a bit of info on branch predictors.)
See Lookup table vs switch in C embedded software, including my answer on that question where I posted some x86 compiler output.
Note the fact that clang will use a jump table to branch to a call instruction, rather than putting the function pointers themselves into an array. So when a jump table is the right choice, implementing it yourself with an array of function pointers will make better code than current clang.
Your code is exactly the same case: dispatching to a lot of functions that all have the same prototype and (lack of) args, so they can go into an array of function pointers.

C++ optimization if performance

consider these 2 situations of if statements:
if( error ==0 )
{
// DO success stuff
}
else
{
// DO error handling stuff
}
and this one:
if( error != 0 )
{
// DO error handling stuff
}
else
{
// DO success stuff
}
which one over performs the other, knowing that most of the time I come to the success code path.
Rather than worrying about this which might be a performance issue only in the rarest of cases, you should ask yourself which is more readable. For error checks, you could use a guard clause, which avoids too many indentations/brackets:
if( error != 0 )
{
// DO error handling stuff
return;
}
// DO success stuff
If you know that one path is more likely than the other and you are sure that this is really performance critical, you could let the compiler know (example for GCC):
if( _builtin_expect (error == 0, 1) )
{
// DO success stuff
}
else
{
// DO error handling stuff
}
Of course, this makes the code harder to read - only use it if really necessary.
It depends.
When the code is run just one time, it would be statistically faster to use the more likely one at first - if and only if the cpu implementation of branch prediction is not "sharing counters" between many lines (e.g. every 16th statement shares same). However, this is not how most code will run. It will run multiple, dozen, a trillion times (e.g. in a while loop).
Multiple runs
None will perform better than the other. The reason is branch prediction. Everytime your program runs an if statement, the cpu will count up or down the amount of times this statement was true. This way it can now predict with high accuracy the next time, if code runs again. If you would test your code a billion times, you will see it won't matter if your if or else part gets executed. CPU will optimize for what it think is the most likely case to occur.
This is a simplified explaination, as CPU branch prediction is smart enough to also see when some code always flip-flops: true, false, true, false, true, false or even true, true, false, true, true, false...
You can learn alot on the wikipedia article about branch prediction
Gcc's default behavior is to optimize for true case of the if statement. Based on that, it will choose either je or jne should be used.
If you know and want to fine control which call path is more likely, use the following macro to find control.
#define likely(x) __builtin_expect((x),1)
#define unlikely(x) __builtin_expect((x),0)
They will perform identically. You are just trading a instruction jz for a jnz observing from Assembly level. None of them executes more or complexier instructions than the other.
It is quite unlikely that you will notice much of a difference between the two pieces of code. Comparing with 0 is the same operation whether the code later jumps if it's "true" or "false". So, use the form that expresses your "meaning" in the code best, rather than try to "outsmart the compiler" (unless you are REALLY good at it, you probably will just confuse things.
Since you have if ... else ... in both cases (most compilers will make a single return point, so even if you have a return in the middle of the function, it will still make a branch from the return to the bottom of the function, and if the condition is false, a branch to jump over it.
The only really beneficial way to solve this is to use hints that the branch is/isn't taken, which at least on some processors can be beneficial (and the compiler can turn the branches around so that the less likely conditions make the most branches). But it's also rather unportable, since the C and C++ languages don't have any features to allow such feedback to the compiler. But some compilers do implement such things.
Of course, the effect/result of this is VERY dependant on what the actual processor is (modern x86 has hints to the processor that feed into the branch prediction unit if there is no "history" for this particular branch - older x86, as used in some embedded systems, etc, won't have that. Other processors may or may not have the same feature - I believe ARM has a couple of bits to say "this is likely taken/not taken" as well). Ideally, for this, you want "profile driven optimisation", so the compiler can instrument and organise the code based on the most likely variants.
Always, use profiling and benchmarks to measure the results of any optimisation. It is often difficult to guess what is better just by looking at the code (even more so if you don't see the machine-code the compiler generates).
Any compiler should optimize the difference. Proof below. If error is set at runtime, then . . .
Using g++4.8 with -O3
This
int main(int argc, char **argv) {
bool error=argv[1];
if( error ){
return 0;
}else{
return 1;
}
}
makes . .
main:
xorl %eax, %eax
cmpq $0, 8(%rsi)
setne %al
ret
and this...
int main(int argc, char **argv) {
bool error=argv[1];
if( !error ){
return 1;
}else{
return 0;
}
}
...makes...
main:
xorl %eax, %eax
cmpq $0, 8(%rsi)
setne %al
ret
Same stuff to the CPU. Use the machine code when in doubt. http://gcc.godbolt.org/

What is faster: compare then change, or change immediately?

Let I'm doing very fast loops and I have to be sure that in the end of each loop the variable a is SOMEVALUE. What will be faster?
if (a != SOMEVALUE) a = SOMEVALUE;
or just instantly do
a = SOMEVALUE;
Is it float/int/bool/language specific?
Update: a is a primitive type, not a class. And the possibility of TRUE comparison is 50%. I know that the algorithm is what makes a loop fast, so my question is also about the coding style.
Update2: thanks everyone for quick answers!
In almost all cases just setting the value will be faster.
It might not be faster when you have to deal with cache line sharing with other cpus or if 'a' is in some special type of memory, but it's safe to assume that a branch misprediction is probably a more common problem than cache sharing.
Also - smaller code is better, not just for the cache but also for making the code comprehensible.
If in doubt - profile.
The general answer is to profile such kind of questions. However, in this case a simple analysis is available:
Each test is a branch. Each branch incurs a slight performance penalty. However, we have branch prediction and this penalty is somewhat amortized in time, depending how many iterations your loop has and how many times the prediction was correct.
Translated into your case, if you have many changes to a during the loop it is very likely that the code using if will be worse in performance. On the other hand, if the value is updated very rarely there would be an infinitely small difference between the two cases.
Still, change immediately is better and should be used, as long as you don't care about the previous value, as your snippets show.
Other reasons for an immediate change: it leads to smaller code thus better cache locality, thus better code performance. It is a very rare situation in which updating a will invalidate a cache line and incur a performance hit. Still, if I remember correctly, this will byte you only on multi processor cases and very rarely.
Keep in mind that there are cases when the two are not similar. Comparing NaNs is undefined behaviour.
Also, this comment treats only the case of C. In C++ you can have classes where the assignment operator / copy constructor takes longer than testing for equality. In that case, you might want to test first.
Taking into account your update, it's better to simply use assignment as long as you're sure of not dealing with undefined behaviour (floats). Coding-style wise it is also better, easier to read.
You should profile it.
My guess would be that there is little difference, depending on how often the test is true (this is due to branch-prediction).
Of course, just setting it has the smallest absolute code size, which frees up instruction cache for more interesting code.
But, again, you should profile it.
I would be surprised is the answer wasn't a = somevalue, but there is no generic answer to this question. Firslty it depends on the speed of copy versus the speed of equality comparison. If the equality comparison is very fast then your first option may be better. Secondly, as always, it depends on your compiler/platform. The only way to answer such questions is to try both methods and time them.
As others have said, profiling it is going to be the easiest way to tell as it depends a lot on what kind of input you're throwing at it. However, if you think about the computational complexity of the two algorithms, the more input you throw at it, the smaller any possible difference of them becomes.
As you are asking this for a C++ program, I assume that you are compiling the code into native machine instructions.
Assigning the value directly without any comparison should be much faster in any case. To compare the values, both the values a and SOMEVALUE should be transferred to registers and one machine instruction cmp() has to be executed.
But in the later case where you assign directly, you just move one value from one memory location to another.
Only way the assignment can be slower is when memory writes are significantly costlier than memory reads. I don't see that happening.
Profile the code. Change accordingly.
For basic types, the no branch option should be faster. MSVS for example doesn't optimize the branch out.
That being said, here's an example of where the comparison version is faster:
struct X
{
bool comparisonDone;
X() : comparisonDone(false) {}
bool operator != (const X& other) { comparisonDone = true; return true; }
X& operator = (const X& other)
{
if ( !comparisonDone )
{
for ( int i = 0 ; i < 1000000 ; i++ )
cout << i;
}
return *this;
}
}
int main()
{
X a;
X SOMEVALUE;
if (a != SOMEVALUE) a = SOMEVALUE;
a = SOMEVALUE;
}
Change immediately is usually faster, as it involves no branch in the code.
As commented below and answered by others, it really depends on many variables, but IMHO the real question is: do you care what was the previous value? If you are, you should check, otherwise, you shouldn't.
That if can actually be 'optimized away' by some compilers, basically turning the if into code noise (for the programmer who's reading it).
When I compile the following function with GCC for x86 (with -O1, which is a pretty reasonable optimization level):
int foo (int a)
{
int b;
if (b != a)
b = a;
b += 5;
return b;
}
GCC just 'optimizes' the if and the assignment away, and simply uses the argument to do the addition:
foo:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
popl %ebp
addl $5, %eax
ret
.ident "GCC: (GNU) 4.4.3"
Having or not having the if generates exact the same code.

Constant embedded for loop condition optimization in C++ with gcc

Will a compiler optimize tihs:
bool someCondition = someVeryTimeConsumingTask(/* ... */);
for (int i=0; i<HUGE_INNER_LOOP; ++i)
{
if (someCondition)
doCondition(i);
else
bacon(i);
}
into:
bool someCondition = someVeryTimeConsumingTask(/* ... */);
if (someCondition)
for (int i=0; i<HUGE_INNER_LOOP; ++i)
doCondition(i);
else
for (int i=0; i<HUGE_INNER_LOOP; ++i)
bacon(i);
someCondition is trivially constant within the for loop.
This may seem obvious and that I should do this myself, but if you have more than one condition then you are dealing with permuatations of for loops, so the code would get quite a bit longer. I am deciding on whether to do it (I am already optimizing) or whether it will be a waste of my time.
It's possible that the compiler might write the code as you did, but I've never seen such optimization.
However there is something called branch prediction in modern CPU. In essence it means that when the processor is asked to execute a conditional jump, it'll start to execute what is judged to be the likeliest branch before evaluating the condition. This is done to keep the pipeline full of instructions.
In case the processor fails (and takes the bad branch) it cause a flush of the pipeline: it's called a misprediction.
A very common trait of this feature is that if the same test produce the same result several times in a row, then it'll be considered to produce the same result by the branch prediction algorithm... which is of course tailored for loops :)
It makes me smile because you are worrying about the if within the for body while the for itself causes a branch prediction >> the condition must be evaluated at each iteration to check whether or not to continue ;)
So, don't worry about it, it costs less than a cache miss.
Now, if you really are worried about this, there is always the functor approach.
typedef void (*functor_t)(int);
functor_t func = 0;
if (someCondition) func = &doCondition;
else func = &bacon;
for (int i=0; i<HUGE_INNER_LOOP; ++i) (*func)(i);
which sure looks much better, doesn't it ? The obvious drawback is the necessity for compatible signatures, but you can write wrappers around the functions for that. As long as you don't need to break/return, you'll be fine with this. Otherwise you would need a if in the loop body :D
It does not seem to do so with either -O2 or -O3 optimisations. This is something you can (and should, if you are concerned with optimisation) test for yourself - compile with the optimisation you are interested in and examine the emitted assembly language.
Have you profiled your app to find out where the slowdowns are? If not, why are you even thinking about optimization? Until you know which methods need to be optimized, you're wasting your time worrying about micro-optimizations like this.
Is this the location of the slowdown? If so, then what you're doing may be useful. Yes, the compiler may optimize this, but there's no guarantee that it does. If this isn't the location of the slowdown, then look elsewhere; the cost of one additional branch every time through the loop is probably trivial relative to all of the other work you're doing.

Force compiler to not optimize side-effect-less statements

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)
Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.
But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:
void float_to_int(float f)
{
int i = static_cast<int>(f); // has no side-effects
}
Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.
The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.
I'm using Visual Studio 2008 / G++ (3.4.4).
Edit
To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.
Edit Again
To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.
We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.
So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.
Premature quoting of Knuth is the root of all annoyance.
Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:
static volatile int i = 0;
void float_to_int(float f)
{
i = static_cast<int>(f); // has no side-effects
}
So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements
You are by definition skewing the results.
Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.
In this case I suggest you make the function return the integer value:
int float_to_int(float f)
{
return static_cast<int>(f);
}
Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.
extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);
Now compare this to an empty function like:
int take_float_return_int(float /* f */)
{
return 1;
}
Which should also be external.
The difference in times should give you an idea of the expense of what you're trying to measure.
What always worked on all compilers I used so far:
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
note that this skews results, boith methods should write to writeMe.
volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:
extern volatile float readMe = 0;
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
int main()
{
readMe = 17;
float_to_int(readMe);
}
Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.
Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)
Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
As others have suggested, you will need to examine the assemnler output to check that this approach actually works.
You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.
These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.
a function call incurs quite a bit of overhead, so I would remove this anyway.
adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).
Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.
R
p.s. found this too:
inline float pslNegFabs32f(float x){
__asm{
fld x //Push 'x' into st(0) of FPU stack
fabs
fchs //change sign
fstp x //Pop from st(0) of FPU stack
}
return x;
}
supposedly also very fast. You might want to profile this too. (although it is hardly portable code)
Return the value?
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.
You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.
If you are using Microsoft's compiler - cl.exe, you can use the following statement to turn optimization on/off on a per-function level [link to doc].
#pragma optimize("" ,{ on |off })
Turn optimizations off for functions defined after the current line:
#pragma optimize("" ,off)
Turn optimizations back on:
#pragma optimize("" ,on)
For example, in the following image, you can notice 3 things.
Compiler optimizations flag is set - /O2, so code will get optimized.
Optimizations are turned off for first function - square(), and turned back on before square2() is defined.
Amount of assembly code generated for 1st function is higher. In second function there is no assembly code generated for int i = num; statement in code.
Thus while 1st function is not optimized, the second function is.
See https://godbolt.org/z/qJTBHg for link to this code on compiler explorer.
A similar directive exists for gcc too - https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html
A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.
GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.
I know the Intel compiler is also extremely good at these kinds of optimizations.
My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.
Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.
All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.
Check the AMD and Intel processor programming manuals for all the gritty details.