Why does this constexpr code cause GCC to eat all my RAM? - c++

The following program will call fun 2 ^ (MAXD + 1) times. The maximum recursion depth should never go above MAXD, though (if my thinking is correct). Thus it may take some time to compile, but it should not eat my RAM.
#include<iostream>
const int MAXD = 20;
constexpr int fun(int x, int depth=0){
return depth == MAXD ? x : fun(fun(x + 1, depth + 1) + 1, depth + 1);
}
int main(){
constexpr int i = fun(1);
std::cout << i << std::endl;
}
The problem is that eating my RAM is exactly what it does. When I turn MAXD up to 30, my laptop starts to swap after GCC 4.7.2 quickly allocates 3 gb or so. I have not yet tried it with clang 3.1, as I don't have access to it right now.
My only guess is that this has something to do with GCC trying to be too clever and memoize the function calls, like it does with templates. If this is so, does it not seem strange that they don't have a limit on how much memoization they do, like the size of a MRU cache table or something? I have not found a switch to disable it.
Why would I do this?
I am toying with the idea of making an advanced compile time library, like genetic programming or something. Since the compilers do not have compile time tail call optimization, I am worried that anything that loops will need recursion and (even if I turn up the maximum recursion depth parameter, which seems slightly ugly to require) will quickly allocate all my RAM and fill it with pointless stack frames. Thus I came up with the above solution for getting arbitrarily many function calls without a deep stack. Such a function could be used for folding/looping or trampolining.
EDIT:
Now I have tried it in clang 3.1, and it will not leak memory at all, no matter how long I make it work (i.e how high I make MAXD). CPU usage is almost 100% and memory usage is almost 0%, just like expected. Perhaps this is just a bug in GCC then.

This may not be the definitive document regarding constexpr, but it's the primary doc linked to from the gcc constexpr wiki.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2235.pdf
... and it says...
We (still) prohibit recursion in all its form in constant expressions.
That is not strictly necessary because an implementation limit on
recursion depth in constant expression evaluation would save us from
the possibility of the compiler recursing forever. However, until we
see a convincing use case for recursion, we don’t propose to allow it.
So, I expect you're bumping up against language boundary and the way that gcc has chosen to implement constexpr (maybe attempting to generate the entire function inline, then evaluating/executing it)

Your answer is in your comment "by running the function runtime and observing that while I can make it run for a long time", which is caused by your inner most recursive call to fun(x + 1, depth + 1).
When you changed it to a runtime function rather than a compile time function by removing constexpr and observed that it ran a long time that's an indicator that it is recursing very deeply.
When the function is executed by the compiler it has to recurse deeply, but doesn't use the stack for recursion since it isn't actually generating and executing machine code.

Related

How much do C/C++ compilers optimize conditional statements?

I recently ran into a situation where I wrote the following code:
for(int i = 0; i < (size - 1); i++)
{
// do whatever
}
// Assume 'size' will be constant during the duration of the for loop
When looking at this code, it made me wonder how exactly the for loop condition is evaluated for each loop. Specifically, I'm curious as to whether or not the compiler would 'optimize away' any additional arithmetic that has to be done for each loop. In my case, would this code get compiled such that (size - 1) would have to be evaluated for every loop iteration? Or is the compiler smart enough to realize that the 'size' variable won't change, thus it could precalculate it for each loop iteration.
This then got me thinking about the general case where you have a conditional statement that may specify more operations than necessary.
As an example, how would the following two pieces of code compile:
if(6)
if(1+1+1+1+1+1)
int foo = 1;
if(foo + foo + foo + foo + foo + foo)
How smart is the compiler? Will the 3 cases listed above be converted into the same machine code?
And while I'm at, why not list another example. What does the compiler do if you are doing an operation within a conditional that won't have any effect on the end result? Example:
if(2*(val))
// Assume val is an int that can take on any value
In this example, the multiplication is completely unnecessary. While this case seems a lot stupider than my original case, the question still stands: will the compiler be able to remove this unnecessary multiplication?
Question:
How much optimization is involved with conditional statements?
Does it vary based on compiler?
Short answer: the compiler is exceptionally clever, and will generally optimise those cases that you have presented (including utterly ignoring irrelevant conditions).
One of the biggest hurdles language newcomers face in terms of truly understanding C++, is that there is not a one-to-one relationship between their code and what the computer executes. The entire purpose of the language is to create an abstraction. You are defining the program's semantics, but the computer has no responsibility to actually follow your C++ code line by line; indeed, if it did so, it would be abhorrently slow as compared to the speed we can expect from modern computers.
Generally speaking, unless you have a reason to micro-optimise (game developers come to mind), it is best to almost completely ignore this facet of programming, and trust your compiler. Write a program that takes the inputs you want, and gives the outputs you want, after performing the calculations you want… and let your compiler do the hard work of figuring out how the physical machine is going to make all that happen.
Are there exceptions? Certainly. Sometimes your requirements are so specific that you do know better than the compiler, and you end up optimising. You generally do this after profiling and determining what your bottlenecks are. And there's also no excuse to write deliberately silly code. After all, if you go out of your way to ask your program to copy a 50MB vector, then it's going to copy a 50MB vector.
But, assuming sensible code that means what it looks like, you really shouldn't spend too much time worrying about this. Because modern compilers are so good at optimising, that you'd be a fool to try to keep up.
The C++ language specification permits the compiler to make any optimization that results in no observable changes to the expected results.
If the compiler can determine that size is constant and will not change during execution, it can certainly make that particular optimization.
Alternatively, if the compiler can also determine that i is not used in the loop (and its value is not used afterwards), that it is used only as a counter, it might very well rewrite the loop to:
for(int i = 1; i < size; i++)
because that might produce smaller code. Even if this i is used in some fashion, the compiler can still make this change and then adjust all other usage of i so that the observable results are still the same.
To summarize: anything goes. The compiler may or may not make any optimization change as long as the observable results are the same.
Yes, there is a lot of optimization, and it is very complex.
It varies based on the compiler, and it also varies based on the compiler options
Check
https://meta.stackexchange.com/questions/25840/can-we-stop-recommending-the-dragon-book-please
for some book recomendations if you really want to understand what a compiler may do. It is a very complex subject.
You can also compile to assembly with the -S option (gcc / g++) to see what the compiler is really doing. Use -O3 / ... / -O0 / -O to experiment with different optimization levels.

C++ array to Halide Image (and back)

I'm getting started with Halide, and whilst I've grasped the basic tenets of its design, I'm struggling with the particulars (read: magic) required to efficiently schedule computations.
I've posted below a MWE of using Halide to copy an array from one location to another. I had assumed this would compile down to only a handful of instructions and take less than a microsecond to run. Instead, it produces 4000 lines of assembly and takes 40ms to run! Clearly, therefore, I have a significant hole in my understanding.
What is the canonical way of wrapping an existing array in a Halide::Image?
How should the function copy be scheduled to perform the copy efficiently?
Minimal working example
#include <Halide.h>
using namespace Halide;
void _copy(uint8_t* in_ptr, uint8_t* out_ptr, const int M, const int N) {
Image<uint8_t> in(Buffer(UInt(8), N, M, 0, 0, in_ptr));
Image<uint8_t> out(Buffer(UInt(8), N, M, 0, 0, out_ptr));
Var x,y;
Func copy;
copy(x,y) = in(x,y);
copy.realize(out);
}
int main(void) {
uint8_t in[10000], out[10000];
_copy(in, out, 100, 100);
}
Compilation Flags
clang++ -O3 -march=native -std=c++11 -Iinclude -Lbin -lHalide copy.cpp
Let me start with your second question: _copy takes a long time, because it needs to compile Halide code to x86 machine code. IIRC, Func caches the machine code, but since copy is local to _copy that cache cannot be reused. Anyways, scheduling copy is pretty simple because it's a pointwise operation: First, it would probably make sense to vectorize it. Second, it might make sense to parallelize it (depending on how much data there is). For example:
copy.vectorize(x, 32).parallel(y);
will vectorize along x with a vector size of 32 and parallelize along y. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...
There is no recipe for good scheduling. I do it by looking at the output of compile_to_lowered_stmt and profiling the code. I also use the AOT compilation provided by Halide::Generator, this makes sure that I only measure the runtime of the code and not the compile time.
Your other question was, how to wrap an existing array in a Halide::Image. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type called buffer_t for everything image related. There is also C++ wrapper called Halide::Buffer that makes using buffer_t a little easier, I think it can also be used in Func::realize instead of Halide::Image. The point is: If you understand buffer_t you can wrap almost everything into something digestible by Halide.
To emphasize the first thing Florian mentioned, which I think is the key point of misunderstanding here: you appear to be timing the compilation of the copy operation ("pipeline," in common Halide terms), not just its execution. Your code size estimate is presumably also for the whole binary resulting from copy.cpp, not just the code in the Halide-generated copy function (which won't actually even appear in the binary you're compiling with clang, since it is only constructed by JITing at runtime in this program).
You can observe the actual cost of your pipeline here by first calling copy.compile_jit() before realize (realize implicitly calls compile_jit the first time it is run, so it's not necessary, but it's valuable to factor apart the runtime from the compile overhead). You would then put your timer exclusively around realize.
If you actually want to pre-compile this (or any other) pipeline for static linking into your ultimate program, which is what it seems you might be expecting, what you really want to do is use Func::compile_to_file in one program to compile and emit the code (as copy.h and copy.o), and then link and call these in another program. Check out tutorial lesson 10 to see this in more detail:
https://github.com/halide/Halide/blob/master/tutorial/lesson_10_aot_compilation_generate.cpp https://github.com/halide/Halide/blob/master/tutorial/lesson_10_aot_compilation_run.cpp

Maximum recursive function calls in C/C++ before stack is full and gives a segmentation fault?

I was doing a question where I used a recursive function to create a segment tree. For larger values it started giving segmentation fault. So I thought before it might be because of array index value out of bound but later I thought it might be because of program stack going too big.
I wrote this code to count what is the maximum number of recursive calls allowed before the system give seg-fault.
#include<iostream>
using namespace std;
void recur(long long int);
int main()
{
recur(0);
return 0;
}
void recur(long long int v)
{
v++;
cout<<v<<endl;
recur(v);
}
After running the above code I got value of v to be 261926 and 261893 and 261816 before getting segmentation fault and all values were close to these.
Now I know that this would depend on machine to machine, and the size of the stack of the function being called but can someone explain the basics of how to keep safe from seg-faults and what is a soft limit that one can keep in mind.
The number of recursion levels you can do depends on the call-stack size combined with the size of local variables and arguments that are placed on such a stack. Aside from "how the code is written", just like many other memory related things, this is very much dependent on the system you're running on, what compiler you are using, optimisation level [1], and so on. Some embedded systems I've worked on, the stack would be a few hundred bytes, my first home computer had 256 bytes of stack, where modern desktops have megabytes of stack (and you can adjust it, but eventually you will run out)
Doing recursion at unlimited depth is not a good idea, and you should look at changing your code to so that "it doesn't do that". You need to understand the algorithm and understand to what depth it will recurse, and whether that is acceptable in your system. There is unfortunately nothing anyone can do at the time stack runs out (at best your program crashes, at worst it doesn't, but instead causes something ELSE to go wrong, such as the stack or heap of some other application gets messed up!)
On a desktop machine, I'd think it's acceptable to have a recursion depth of a hew hundred to some thousands, but not much more than this - and that is if you have small usage of stack in each call - if each call is using up kilobytes of stack, you should limit the call level even further, or reduce the need for stack-space.
If you need to have more recursion depth than that, you need to re-arrange the code - for example using a software stack to store the state, and a loop in the code itself.
[1] Using g++ -O2 on your posted code, I got to 50 million and counting, and I expect if I leave it long enough, it will restart at zero because it keeps going forever - this since g++ detects that this recursion can be converted into a loop, and does that. Same program compiled with -O0 or -O1 does indeed stop at a little over 200000. With clang++ -O1 it just keeps going. The clang-compiled code is still running as I finished writing the rest of the code, at 185 million "recursions".
There is (AFAIK) no well established limit. (I am answering from a Linux desktop point of view).
On desktops, laptops the default stack size is a few megabytes in 2015. On Linux you could use setrlimit(2) to change it (to a reasonable figure, don't expect to be able to set it to a gigabyte these days) - and you could use getrlimit(2) or parse /proc/self/limits (see proc(5)) to query it . On embedded microcontrollers - or inside the Linux kernel- , the entire stack may be much more limited (to a few kilobytes in total).
When you create a thread using pthread_create(3) you could use an explicit pthread_attr_t and use pthread_attr_setstack(3) to set the stack space.
BTW, with recent GCC, you might compile all your software (including the standard C library) with split stacks (so pass -fsplit-stack to gcc or g++)
At last your example is a tail call, and GCC could optimize that (into a jump with arguments). I checked that if you compile with g++ -O2 (using GCC 4.9.2 on Linux/x86-64/Debian) the recursion would be transformed into a genuine loop and no stack allocation would grow indefinitely (your program run for nearly 40 millions calls to recur in a minute, then I interrupted it) In better languages like Scheme or Ocaml there is a guarantee that tail calls are indeed compiled iteratively (then the tail recursive call becomes the usually -or even the only- looping construct).
CyberSpok is excessive in his comment (hinting to avoid recursions). Recursions are very useful, but you should limit them to a reasonable depth (e.g. a few thousands), and you should take care that call frames on the call stack are small (less than a kilobyte each), so practically allocate and deallocate most of the data in the C heap. The GCC -fstack-usage options is really useful for reporting stack usage of every compiled function. See this and that answers.
Notice that continuation passing style is a canonical way to transform recursions into iterations (then you trade stack frames with dynamically allocated closures).
Some clever algorithms replace a recursion with fancy modifying iterations, e.g. the Deutche-Shorr-Waite graph marking algorithm.
For Linux based applications, we can use getrlimit and setrlimit API's to know various kernel resource limits, like size of core file, cpu time, stack size, nice values, max. no. of processes etc. 'RLIMIT_STACK' is the resource name for stack defined in linux kernel. Below is simple program to retrieve stack size :
#include <iostream>
#include <sys/time.h>
#include <sys/resource.h>
#include <errno.h>
using namespace std;
int main()
{
struct rlimit sl;
int returnVal = getrlimit(RLIMIT_STACK, &sl);
if (returnVal == -1)
{
cout << "Error. errno: " << errno << endl;
}
else if (returnVal == 0)
{
cout << "stackLimit soft - max : " << sl.rlim_cur << " - " << sl.rlim_max << endl;
}
}

Force compiler to not optimize side-effect-less statements

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)
Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.
But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:
void float_to_int(float f)
{
int i = static_cast<int>(f); // has no side-effects
}
Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.
The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.
I'm using Visual Studio 2008 / G++ (3.4.4).
Edit
To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.
Edit Again
To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.
We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.
So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.
Premature quoting of Knuth is the root of all annoyance.
Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:
static volatile int i = 0;
void float_to_int(float f)
{
i = static_cast<int>(f); // has no side-effects
}
So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements
You are by definition skewing the results.
Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.
In this case I suggest you make the function return the integer value:
int float_to_int(float f)
{
return static_cast<int>(f);
}
Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.
extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);
Now compare this to an empty function like:
int take_float_return_int(float /* f */)
{
return 1;
}
Which should also be external.
The difference in times should give you an idea of the expense of what you're trying to measure.
What always worked on all compilers I used so far:
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
note that this skews results, boith methods should write to writeMe.
volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:
extern volatile float readMe = 0;
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
int main()
{
readMe = 17;
float_to_int(readMe);
}
Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.
Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)
Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
As others have suggested, you will need to examine the assemnler output to check that this approach actually works.
You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.
These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.
a function call incurs quite a bit of overhead, so I would remove this anyway.
adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).
Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.
R
p.s. found this too:
inline float pslNegFabs32f(float x){
__asm{
fld x //Push 'x' into st(0) of FPU stack
fabs
fchs //change sign
fstp x //Pop from st(0) of FPU stack
}
return x;
}
supposedly also very fast. You might want to profile this too. (although it is hardly portable code)
Return the value?
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.
You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.
If you are using Microsoft's compiler - cl.exe, you can use the following statement to turn optimization on/off on a per-function level [link to doc].
#pragma optimize("" ,{ on |off })
Turn optimizations off for functions defined after the current line:
#pragma optimize("" ,off)
Turn optimizations back on:
#pragma optimize("" ,on)
For example, in the following image, you can notice 3 things.
Compiler optimizations flag is set - /O2, so code will get optimized.
Optimizations are turned off for first function - square(), and turned back on before square2() is defined.
Amount of assembly code generated for 1st function is higher. In second function there is no assembly code generated for int i = num; statement in code.
Thus while 1st function is not optimized, the second function is.
See https://godbolt.org/z/qJTBHg for link to this code on compiler explorer.
A similar directive exists for gcc too - https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html
A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.
GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.
I know the Intel compiler is also extremely good at these kinds of optimizations.
My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.
Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.
All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.
Check the AMD and Intel processor programming manuals for all the gritty details.

How to correctly benchmark a [templated] C++ program

< backgound>
I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.
< /background>
Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less).
Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.
However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.
Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).
The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.
If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.
So how can I do a correct benchmark of a little feature extracted from a library ?
Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?
Thanks for advices !
There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.
I think I will try all of these.
Thanks a lot for all your answers!
If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.
template<typename T> void sink(T const& t) {
volatile T sinkhole = t;
}
No iostream overhead, just a copy that has to remain in the generated code.
Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.
Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.
Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.
Compilers are only allowed to eliminate code-branches that can not happen. As long as it cannot rule out that a branch should be executed, it will not eliminate it. As long as there is some data dependency somewhere, the code will be there and will be run. Compilers are not too smart about estimating which aspects of a program will not be run and don't try to, because that's a NP problem and hardly computable. They have some simple checks such as for if (0), but that's about it.
My humble opinion is that you were possibly hit by some other problem earlier on, such as the way C/C++ evaluates boolean expressions.
But anyways, since this is about a test of speed, you can check that things get called for yourself - run it once without, then another time with a test of return values. Or a static variable being incremented. At the end of the test, print out the number generated. The results will be equal.
To answer your question about in-vitro testing: Yes, do that. If your app is so time-critical, do that. On the other hand, your description hints at a different problem: if your deltas are in a timeframe of 1e-3 seconds, then that sounds like a problem of computational complexity, since the method in question must be called very, very often (for few runs, 1e-3 seconds is neglectible).
The problem domain you are modeling sounds VERY complex and the datasets are probably huge. Such things are always an interesting effort. Make sure that you absolutely have the right data structures and algorithms first, though, and micro-optimize all you want after that. So, I'd say look at the whole context first. ;-)
Out of curiosity, what is the problem you are calculating?
You have a lot of control on the optimizations for your compilation. -O1, -O2, and so on are just aliases for a bunch of switches.
From the man pages
-O2 turns on all optimization flags specified by -O. It also turns
on the following optimization flags: -fthread-jumps -falign-func‐
tions -falign-jumps -falign-loops -falign-labels -fcaller-saves
-fcrossjumping -fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks -fexpensive-optimizations -fgcse
-fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐
order-blocks -freorder-functions -frerun-cse-after-loop
-fsched-interblock -fsched-spec -fschedule-insns -fsched‐
ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre
-ftree-vrp
You can tweak and use this command to help you narrow down which options to investigate.
...
Alternatively you can discover which binary optimizations are
enabled by -O3 by using:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled
Once you find the culpret optimization you shouldn't need the cout's.
If this is possible for you, you might try splitting your code into:
the library you want to test compiled with all optimizations turned on
a test program, dinamically linking the library, with optimizations turned off
Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).
You could create a dummy function in a separate cpp file that does nothing, but takes as argument whatever is the type of your calculation result. Then you can call that function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of invoking a function (which shouldn't skew your results unless you call it a lot!).
#include <iostream>
// Mark coords as extern.
// Compiler is now NOT allowed to optimise away coords
// This it can not remove the loop where you initialise it.
// This is because the code could be used by another compilation unit
extern double coords[500][3];
double coords[500][3];
int main()
{
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
std::cout << "hello world !"<< std::endl;
return 0;
}
edit: the easiest thing you can do is simply use the data in some spurious way after the function has run and outside your benchmarks. Like,
StartBenchmarking(); // ie, read a performance counter
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
StopBenchmarking(); // what comes after this won't go into the timer
// this is just to force the compiler to use coords
double foo;
for (int j = 0 ; j < 500 ; ++j )
{
foo += coords[j][0] + coords[j][1] + coords[j][2];
}
cout << foo;
What sometimes works for me in these cases is to hide the in vitro test inside a function and pass the benchmark data sets through volatile pointers. This tells the compiler that it must not collapse subsequent writes to those pointers (because they might be eg memory-mapped I/O). So,
void test1( volatile double *coords )
{
//perform a simple initialization of all coordinates:
for (int i=0; i<1500; i+=3)
{
coords[i+0] = 3.23;
coords[i+1] = 1.345;
coords[i+2] = 123.998;
}
}
For some reason I haven't figured out yet it doesn't always work in MSVC, but it often does -- look at the assembly output to be sure. Also remember that volatile will foil some compiler optimizations (it forbids the compiler from keeping the pointer's contents in register and forces writes to occur in program order) so this is only trustworthy if you're using it for the final write-out of data.
In general in vitro testing like this is very useful so long as you remember that it is not the whole story. I usually test my new math routines in isolation like this so that I can quickly iterate on just the cache and pipeline characteristics of my algorithm on consistent data.
The difference between test-tube profiling like this and running it in "the real world" means you will get wildly varying input data sets (sometimes best case, sometimes worst case, sometimes pathological), the cache will be in some unknown state on entering the function, and you may have other threads banging on the bus; so you should run some benchmarks on this function in vivo as well when you are finished.
I don't know if GCC has a similar feature, but with VC++ you can use:
#pragma optimize
to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.
Just a small example of an unwanted optimization:
#include <vector>
#include <iostream>
using namespace std;
int main()
{
double coords[500][3];
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
cout << "hello world !"<< endl;
return 0;
}
If you comment the code from "double coords[500][3]" to the end of the for loop it will generate exactly the same assembly code (just tried with g++ 4.3.2). I know this example is far too simple, and I wasn't able to show this behavior with a std::vector of a simple "Coordinates" structure.
However I think this example still shows that some optimizations can introduce errors in the benchmark and I wanted to avoid some surprises of this kind when introducing new code in a library. It's easy to imagine that the new context might prevent some optimizations and lead to a very inefficient library.
The same should also apply with virtual functions (but I don't prove it here). Used in a context where a static link would do the job I'm pretty confident that decent compilers should eliminate the extra indirection call for the virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal.
Then I'll call it hundred of thousand times in a context where the compiler cannot guess what will be the exact type of the pointer and have a 20% increase of running time...
at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;
The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.