Do any C or C++ compilers optimize within define macros?

Do any C or C++ compilers optimize within define macros? - c++

Let's say I have the following in C or C++:
#include <math.h>
#define ROWS 15
#define COLS 16
#define COEFF 0.15
#define NODES (ROWS*COLS)
#define A_CONSTANT (COEFF*(sqrt(NODES)))
Then, I go and use NODES and A_CONSTANT somewhere deep within many nested loops (i.e. used many times). Clearly, both have numeric values that can be ascertained at compile-time, but do compilers actually do it? At run-time, will the CPU have to evaluate 15*16 every time it sees NODES, or will the compiler statically put 240 there? Similarly, will the CPU have to evaluate a square root every time it sees A_CONSTANT?
My guess is that the ROWS*COLS multiplication is optimized out but nothing else is. Integer multiplication is built into the language but sqrt is a library function. If this is indeed the case, is there any way to get a magic number equivalent to A_CONSTANT such that the square root is evaluated only once at run-time?

Macro definitions are expanded by simple textual substitution into the source code before it's handed to the compiler proper, which may do optimization. A compiler will generate exactly the same code for the expressions NODES, ROWS*COLS and 15*16 (and I can't think of a single one that will do the multiplication every time round the loop with optimization enabled).
As for A_CONSTANT, the fact that it is a macro again doesn't matter; what matters is whether the compiler is smart enough to figure out that sqrt of a constant is a constant (assuming that's sqrt from <math.h>). I know GCC is smart enough and I expect other production-quality compilers to be smart enough as well.

Anything in a #define is inserted into the source as a pre-compile step which means that once the code is compiled the macros have basically disappeared and the code is compiled as usual. Whether or not it is optimized depends on your code, compiler and complier settings.

It depends on your compiler.
#include <math.h>
#define FOO sqrt(5);
double
foo()
{
return FOO;
}
My compiler (gcc 4.1.2) generates the following assembly for this code:
.LC0:
.long 2610427048
.long 1073865591
.text
.p2align 4,,15
.globl foo
.type foo, #function
foo:
.LFB2:
movsd .LC0(%rip), %xmm0
ret
.LFE2:
So it does know that sqrt(5) is a compile-time constant.
If your compiler is not so smart, I do not know of any portable way to compute a square root at compile time. (Of course, you can compute the result once and store it in a global or whatever, but that is not the same thing as a compile-time constant.)

There's really two questions here:
Does the compiler optimize expressions found inside macros?
Does the compiler optimize sqrt()?
(1) is easy: Yes, it does. The preprocessor is seperate from the C compiler, and does its thing before the C compiler even starts. So if you have
#define ROWS 15
#define COLS 16
#define NODES (ROWS*COLS)
void foo( )
{
int data[ROWS][COLS];
printf( "I have %d pieces of data\n", NODES );
for ( int *i = data; i < data + NODES ; ++i )
{
printf("%d ", *i);
}
}
The compiler will actually see:
void foo( )
{
int data[15][16];
printf( "I have %d pieces of data\n", (15*16) );
for ( int *i = data; i < data + (15*16) ; ++i )
{
printf("%d ", *i);
}
}
And that is subject to all the usual compile-time constant optimization.
sqrt() is trickier because it varies from compiler to compiler. In most modern compilers, sqrt() is actually a compiler intrinsic rather than a library function — it looks like a function call, but it is actually a special case inside the compiler that has additional heuristics based on mathematical laws, hardware ops, etc. In smart compilers where sqrt() is such a special case, sqrt() of a constant value will be translated internally to a constant number. In stupid compilers, it will result in a function call each time. The only way to know which you're getting is to compile the code and look at the emitted assembly.
From what I've seen, MSVC, modern GCC, Intel, IBM, and SN all handle sqrt as intrinisc. Old GCC and some crappy vendor-supplied compilers for embedded chips do not.

#defines are handled before compilation, with simple text replacement. The resulting text file is then passed to the actual compilation step.
If you are using gcc, try compiling a source file with the -E switch, which will do the preprocessing and then stop. Look at the generated file to see the actual input to the compilation step.

The macro will be substituted, and then the code compiled like the rest of the code. If you've turned on optimization (and the compiler you're using does decent optimization) you can probably expect things like this to be computed at compile time.
To put that in perspective, there are relatively few C++ compilers old enough that you'd expect them to lack optimization like that. Compilers old enough to lack that simple of optimization will generally be C only (and even then, don't count on it -- definitely things like MS C 5.0/5.1/6.0, Datalight/Zortech C, Borland, etc., did this as well. From what I recall, the C compilers that ran on CP/M mostly didn't though.

Related

Does C++ support named constants which are guaranteed to not take up memory?

The question is more academic because even a literal is also eventually stored in memory, at least in the machine code for the instruction it is used in. Still, is there a way to ensure that an identifier will be done away with at compile time and not turn into what is essentially a handicapped variable with memory location and all?

Unfortunately, no. C++ doesn't specify the object format, and therefore, it also doesn't specify what exactly goes into the object file and what doesn't. Implementations are free to pack as much extra stuff into the binary as they want, or even omit things that they determine to not be necessary under the as-if rule.
In fact, we can make a very simple thought experiment to come to a definitive answer. C++ doesn't require there to be a compiler at all. A conformant C++ interpreter is a perfectly valid implementation of the C++ standard. This interpreter could parse your C++ code into an Abstract Syntax Tree and serialize it to disk. To execute it, it loads the AST and evaluates it, one line of C++ code after the other. Your constexpr variable, #define, enum constants, etc all get loaded into memory by necessity. (This isn't even as hypothetical as you might think: It's exactly what happens during constant evaluation at compile time.)
In other words: The C++ standard has no concept of object format or even compilation. Since it doesn't know what compilation even is, it can't specify any details of that process, so there are no rules on what's kept and what's thrown away during compilation.
The C++ Abstract Machine strikes again.
In practice, there are architectures (like ARM) that don't have instructions to load arbitrary immediates into registers, which means that even a plain old integer literal like 1283572434 will go into a dedicated constant variable section in memory, which you can take the address of. The same can and will happen with constexpr variables, enums, and even #define.
Compilers for x86-64 do this as well for constants that are too large to load via regular mov reg, imm instructions. Very large 256-bit or even 512-bit constants are generally loaded into vector registers by loading them from a constant section somewhere in memory.
Most compilers are of course smart enough to optimize away constants that are only used at compile time. It's not guaranteed by the standard, though, and not even by the compilers themselves.
Here's an example where GCC places a #define-d constant into a variable and loads it from memory when needed (Godbolt):
#include <immintrin.h>
#define THAT_VERY_LARGE_VALUE __m256i{1111, 2222, 3333, 4444}
__m256i getThatValue() {
return THAT_VERY_LARGE_VALUE;
}

The standard way is enum. It has 3 forms:
enum {THE_VALUE = 42};
Usage: std::cout << THE_VALUE;
enum MyContainerForConstants {THE_VALUE = 42};
Usage: as above, and also std::cout << MyContainerForConstants::THE_VALUE;
enum: unsigned short {THE_VALUE = 42};
You can specify a type if you want.
A macro: #define
#define THE_VALUE 42
Usage: std::cout << THE_VALUE;
A consteval function. Use this if your constant requires non-trivial code to calculate.
consteval int the_other_value()
{
int r = 0;
for (int i = 0; i < 10; ++i)
r += i;
return r;
}
Usage: std::cout << the_other_value();
If the value happens to be 0, it may not appear in code: for example, a function returning 0 has xor eax, eax in its machine code — the literal 0 doesn't appear there. But for all other values, the constant will appear in the machine code (at least if you use x86/x64 machine code).
While it's possible to obfuscate the machine code and hide constant numbers, no compiler supports this useless feature.

Is there a compiler hint for GCC to force branch prediction to always go a certain way?

For the Intel architectures, is there a way to instruct the GCC compiler to generate code that always forces branch prediction a particular way in my code? Does the Intel hardware even support this? What about other compilers or hardwares?
I would use this in C++ code where I know the case I wish to run fast and do not care about the slow down when the other branch needs to be taken even when it has recently taken that branch.
for (;;) {
if (normal) { // How to tell compiler to always branch predict true value?
doSomethingNormal();
} else {
exceptionalCase();
}
}
As a follow on question for Evdzhan Mustafa, can the hint just specify a hint for the first time the processor encounters the instruction, all subsequent branch prediction, functioning normally?

GCC supports the function __builtin_expect(long exp, long c) to provide this kind of feature. You can check the documentation here.
Where exp is the condition used and c is the expected value. For example in you case you would want
if (__builtin_expect(normal, 1))
Because of the awkward syntax this is usually used by defining two custom macros like
#define likely(x) __builtin_expect (!!(x), 1)
#define unlikely(x) __builtin_expect (!!(x), 0)
just to ease the task.
Mind that:
this is non standard
a compiler/cpu branch predictor are likely more skilled than you in deciding such things so this could be a premature micro-optimization

No, there is not. (At least on modern x86 processors.)
__builtin_expect mentioned in other answers influences the way gcc arranges the assembly code. It does not directly influence the CPU's branch predictor. Of course, there will be indirect effects on branch prediction caused by reordering the code. But on modern x86 processors there is no instruction that tells the CPU "assume this branch is/isn't taken".
See this question for more detail: Intel x86 0x2E/0x3E Prefix Branch Prediction actually used?
To be clear, __builtin_expect and/or the use of -fprofile-arcs can improve the performance of your code, both by giving hints to the branch predictor through code layout (see Performance optimisations of x86-64 assembly - Alignment and branch prediction), and also improving cache behaviour by keeping "unlikely" code away from "likely" code.

gcc has long __builtin_expect (long exp, long c) (emphasis mine):
You may use __builtin_expect to provide the compiler with branch
prediction information. In general, you should prefer to use actual
profile feedback for this (-fprofile-arcs), as programmers are
notoriously bad at predicting how their programs actually perform.
However, there are applications in which this data is hard to collect.
The return value is the value of exp, which should be an integral
expression. The semantics of the built-in are that it is expected that
exp == c. For example:
if (__builtin_expect (x, 0))
foo ();
indicates that we do not expect to call foo, since we expect x to be
zero. Since you are limited to integral expressions for exp, you
should use constructions such as
if (__builtin_expect (ptr != NULL, 1))
foo (*ptr);
when testing pointer or floating-point values.
As the documentation notes you should prefer to use actual profile feedback and this article shows a practical example of this and how it in their case at least ends up being an improvement over using __builtin_expect. Also see How to use profile guided optimizations in g++?.
We can also find a Linux kernel newbies article on the kernal macros likely() and unlikely() which use this feature:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
Note the !! used in the macro we can find the explanation for this in Why use !!(condition) instead of (condition)?.
Just because this technique is used in the Linux kernel does not mean it always makes sense to use it. We can see from this question I recently answered difference between the function performance when passing parameter as compile time constant or variable that many hand rolled optimizations techniques don't work in the general case. We need to profile code carefully to understand whether a technique is effective. Many old techniques may not even be relevant with modern compiler optimizations.
Note, although builtins are not portable clang also supports __builtin_expect.
Also on some architectures it may not make a difference.

The correct way to define likely/unlikely macros in C++11 is the following:
#define LIKELY(condition) __builtin_expect(static_cast<bool>(condition), 1)
#define UNLIKELY(condition) __builtin_expect(static_cast<bool>(condition), 0)
This method is compatible with all C++ versions, unlike [[likely]], but relies on non-standard extension __builtin_expect.
When these macros defined this way:
#define LIKELY(condition) __builtin_expect(!!(condition), 1)
That may change the meaning of if statements and break the code. Consider the following code:
#include <iostream>
struct A
{
explicit operator bool() const { return true; }
operator int() const { return 0; }
};
#define LIKELY(condition) __builtin_expect((condition), 1)
int main() {
A a;
if(a)
std::cout << "if(a) is true\n";
if(LIKELY(a))
std::cout << "if(LIKELY(a)) is true\n";
else
std::cout << "if(LIKELY(a)) is false\n";
}
And its output:
if(a) is true
if(LIKELY(a)) is false
As you can see, the definition of LIKELY using !! as a cast to bool breaks the semantics of if.
The point here is not that operator int() and operator bool() should be related. Which is good practice.
Rather that using !!(x) instead of static_cast<bool>(x) loses the context for C++11 contextual conversions.

As the other answers have all adequately suggested, you can use __builtin_expect to give the compiler a hint about how to arrange the assembly code. As the official docs point out, in most cases, the assembler built into your brain will not be as good as the one crafted by the GCC team. It's always best to use actual profile data to optimize your code, rather than guessing.
Along similar lines, but not yet mentioned, is a GCC-specific way to force the compiler to generate code on a "cold" path. This involves the use of the noinline and cold attributes, which do exactly what they sound like they do. These attributes can only be applied to functions, but with C++11, you can declare inline lambda functions and these two attributes can also be applied to lambda functions.
Although this still falls into the general category of a micro-optimization, and thus the standard advice applies—test don't guess—I feel like it is more generally useful than __builtin_expect. Hardly any generations of the x86 processor use branch prediction hints (reference), so the only thing you're going to be able to affect anyway is the order of the assembly code. Since you know what is error-handling or "edge case" code, you can use this annotation to ensure that the compiler won't ever predict a branch to it and will link it away from the "hot" code when optimizing for size.
Sample usage:
void FooTheBar(void* pFoo)
{
if (pFoo == nullptr)
{
// Oh no! A null pointer is an error, but maybe this is a public-facing
// function, so we have to be prepared for anything. Yet, we don't want
// the error-handling code to fill up the instruction cache, so we will
// force it out-of-line and onto a "cold" path.
[&]() __attribute__((noinline,cold)) {
HandleError(...);
}();
}
// Do normal stuff
⋮
}
Even better, GCC will automatically ignore this in favor of profile feedback when it is available (e.g., when compiling with -fprofile-use).
See the official documentation here: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

As of C++20 the likely and unlikely attributes should be standardized and are already supported in g++9. So as discussed here, you can write
if (a > b) {
/* code you expect to run often */
[[likely]] /* last statement here */
}
e.g. in the following code the else block gets inlined thanks to the [[unlikely]] in the if block
int oftendone( int a, int b );
int rarelydone( int a, int b );
int finaltrafo( int );
int divides( int number, int prime ) {
int almostreturnvalue;
if ( ( number % prime ) == 0 ) {
auto k = rarelydone( number, prime );
auto l = rarelydone( number, k );
[[unlikely]] almostreturnvalue = rarelydone( k, l );
} else {
auto a = oftendone( number, prime );
almostreturnvalue = oftendone( a, a );
}
return finaltrafo( almostreturnvalue );
}
godbolt link comparing the presence/absence of the attribute

__builtin_expect can be used to tell the compiler which way you expect a branch to go. This can influence how the code is generated. Typical processors run code faster sequentially. So if you write
if (__builtin_expect (x == 0, 0)) ++count;
if (__builtin_expect (y == 0, 0)) ++count;
if (__builtin_expect (z == 0, 0)) ++count;
the compiler will generate code like
if (x == 0) goto if1;
back1: if (y == 0) goto if2;
back2: if (z == 0) goto if3;
back3: ;
...
if1: ++count; goto back1;
if2: ++count; goto back2;
if3: ++count; goto back3;
If your hint is correct, this will execute the code without any branches actually performed. It will run faster than the normal sequence, where each if statement would branch around the conditional code and would execute three branches.
Newer x86 processors have instructions for branches that are expected to be taken, or for branches that are expected not to be taken (there's an instruction prefix; not sure about the details). Not sure if the processor uses that. It is not very useful, because branch prediction will handle this just fine. So I don't think you can actually influence the branch prediction.

With regards to the OP, no, there is no way in GCC to tell the processor to always assume the branch is or isn't taken. What you have is __builtin_expect, which does what others say it does. Furthermore, I think you don't want to tell the processor whether the branch is taken or not always. Today's processors, such as the Intel architecture can recognize fairly complex patterns and adapt effectively.
However, there are times you want to assume control of whether by default a branch is predicted taken or not: When you know the code will be called "cold" with respect of branching statistics.
One concrete example: Exception management code. By definition the management code will happen exceptionally, but perhaps when it occurs maximum performance is desired (there may be a critical error to take care off as soon as possible), hence you may want to control the default prediction.
Another example: You may classify your input and jump into the code that handles the result of your classification. If there are many classifications, the processor may collect statistics but lose them because the same classification does not happen soon enough and the prediction resources are devoted to recently called code. I wish there would be a primitive to tell the processor "please do not devote prediction resources to this code" the way you sometimes can say "do not cache this".

Does const use more or less memory than #define typically?

I understand how each works, but I was curious if one or the other actually is more efficient memory-wise. #define seems to be used all the time in the embedded C world, but I am wondering if it is actually justified over a const most of the time.
If one is more efficient than the other, does anyone have some way to test and show this?

Let's put #define aside, because it doesn't really exist in your program. The preprocessor takes your macros and expands them before the compiler can even spot that they were ever there.
The following source:
#define X 42
printf("%d", X);
is actually the following program:
printf("%d", 42);
So what you're asking is whether that takes more or less memory than:
const int x = 42;
printf("%d", x);
And this is a question we can't fully answer in general.
On the one hand, the value 42 needs to live in your program somewhere, otherwise the computer executing it won't know what to do.
On the other hand, it can either live hardcoded in your program, having been optimised out, or it can be installed into memory at runtime them pulled out again.
Either way, it takes 32 bits (it may not be 32) and it doesn't really matter how you introduced it into your program.
Any further analysis depends on precisely what you are doing with the value.

It depends on whether you are taking the address of the constant or not. If you do not take the address of the constant, then the compiler has no problem folding it into other computations and emitting it inline (as an immediate or literal), just like the #defined version. However, if you write:
const int c = 42;
const int *pc = &c;
then a copy of c must live in the global .rodata section in order to have its address taken, adding sizeof(int) bytes of Flash space atop any copies the compiler decided to inline; however, the compiler might be able to fetch that constant from memory more cheaply than it can incorporate it as an inline literal, depending on its value and what CPU you're compiling for.
Try compiling some code each way and looking at the resulting assembler listings...

Compilation and Code Optimization

I will state my problem in a very simplified form, which is:
If I type in C
void main(){
int a=3+2;
double b=7/2;
}
When will a and b, be assigned their values of 5 and 3.5 is it when I compile my code or is it when I run the code?
In other words, What will happen when I press compile? and how it is different from the case when I press run, in terms of assigning the values and doing the computations and how is that different from writing my code as:
void main(){
int a=5;
double b=3.5;
}
I am asking this because I have heard about compiler optimization but it is not really my area.
Any comments, reviews will be highly appreciated.
Thank you.

Since you are asking about "code optimization" - a good optimizing compiler will optimize this code down to void main(){}. a and b will be completely eliminated.
Also, 7/2 == 3, not 3.5

Compiling will translate the high-level language into the lower language, such as assembly. A good compiler may optimize, and this can be customizable (for example with -O2) option or so.
Regarding your code, double b=7/2; will yield 3.0 instead of 3.5, because you do an integer and integer operation. If you would like to have 3.5, you should do it like double b=7.0/2.0;. This is a quite common mistake that people do.

What will happen when I press compile?
Nobody knows. The compiler may optimize it to a constant, or it may not. It probably will, but it isn't required to.
You generally shouldn't worry or really even think about compiler optimization, unless you're in a position that absolutely needs it, which very few developers are. The compiler can usually do a better job than you can.

It's compiler-dependent, a good one will do CF and/or DCE

I don't know anything about optimization either, but I decided to give this a shot. Using, gcc -c -S test.c I got the assembly for the function. Here's what the line int a = 3 + 2 comes out as.
movl $5, -4(%rbp)
So for me, it's converting the value (3+2) to 5 at compile time, but it depends on the compiler and platform and whatever flags you pass it.
(Also, I made the function return a just so that it wouldn't optimize the code out entirely.)

Force compiler to not optimize side-effect-less statements

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)
Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.
But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:
void float_to_int(float f)
{
int i = static_cast<int>(f); // has no side-effects
}
Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.
The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.
I'm using Visual Studio 2008 / G++ (3.4.4).
Edit
To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.
Edit Again
To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.
We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.
So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.
Premature quoting of Knuth is the root of all annoyance.

Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:
static volatile int i = 0;
void float_to_int(float f)
{
i = static_cast<int>(f); // has no side-effects
}

So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements
You are by definition skewing the results.
Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.

In this case I suggest you make the function return the integer value:
int float_to_int(float f)
{
return static_cast<int>(f);
}
Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.
extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);
Now compare this to an empty function like:
int take_float_return_int(float /* f */)
{
return 1;
}
Which should also be external.
The difference in times should give you an idea of the expense of what you're trying to measure.

What always worked on all compilers I used so far:
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
note that this skews results, boith methods should write to writeMe.
volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:
extern volatile float readMe = 0;
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
int main()
{
readMe = 17;
float_to_int(readMe);
}
Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.
Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)

Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
As others have suggested, you will need to examine the assemnler output to check that this approach actually works.

You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.
These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.

a function call incurs quite a bit of overhead, so I would remove this anyway.
adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).
Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.
R
p.s. found this too:
inline float pslNegFabs32f(float x){
__asm{
fld x //Push 'x' into st(0) of FPU stack
fabs
fchs //change sign
fstp x //Pop from st(0) of FPU stack
}
return x;
}
supposedly also very fast. You might want to profile this too. (although it is hardly portable code)

Return the value?
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.
You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.

If you are using Microsoft's compiler - cl.exe, you can use the following statement to turn optimization on/off on a per-function level [link to doc].
#pragma optimize("" ,{ on |off })
Turn optimizations off for functions defined after the current line:
#pragma optimize("" ,off)
Turn optimizations back on:
#pragma optimize("" ,on)
For example, in the following image, you can notice 3 things.
Compiler optimizations flag is set - /O2, so code will get optimized.
Optimizations are turned off for first function - square(), and turned back on before square2() is defined.
Amount of assembly code generated for 1st function is higher. In second function there is no assembly code generated for int i = num; statement in code.
Thus while 1st function is not optimized, the second function is.
See https://godbolt.org/z/qJTBHg for link to this code on compiler explorer.
A similar directive exists for gcc too - https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html

A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.

GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.
I know the Intel compiler is also extremely good at these kinds of optimizations.
My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.
Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.
All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.
Check the AMD and Intel processor programming manuals for all the gritty details.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js