C++ array to Halide Image (and back)

C++ array to Halide Image (and back) - c++

I'm getting started with Halide, and whilst I've grasped the basic tenets of its design, I'm struggling with the particulars (read: magic) required to efficiently schedule computations.
I've posted below a MWE of using Halide to copy an array from one location to another. I had assumed this would compile down to only a handful of instructions and take less than a microsecond to run. Instead, it produces 4000 lines of assembly and takes 40ms to run! Clearly, therefore, I have a significant hole in my understanding.
What is the canonical way of wrapping an existing array in a Halide::Image?
How should the function copy be scheduled to perform the copy efficiently?
Minimal working example
#include <Halide.h>
using namespace Halide;
void _copy(uint8_t* in_ptr, uint8_t* out_ptr, const int M, const int N) {
Image<uint8_t> in(Buffer(UInt(8), N, M, 0, 0, in_ptr));
Image<uint8_t> out(Buffer(UInt(8), N, M, 0, 0, out_ptr));
Var x,y;
Func copy;
copy(x,y) = in(x,y);
copy.realize(out);
}
int main(void) {
uint8_t in[10000], out[10000];
_copy(in, out, 100, 100);
}
Compilation Flags
clang++ -O3 -march=native -std=c++11 -Iinclude -Lbin -lHalide copy.cpp

Let me start with your second question: _copy takes a long time, because it needs to compile Halide code to x86 machine code. IIRC, Func caches the machine code, but since copy is local to _copy that cache cannot be reused. Anyways, scheduling copy is pretty simple because it's a pointwise operation: First, it would probably make sense to vectorize it. Second, it might make sense to parallelize it (depending on how much data there is). For example:
copy.vectorize(x, 32).parallel(y);
will vectorize along x with a vector size of 32 and parallelize along y. (I am making this up from memory, there might be some confusion about the correct names.) Of course, doing all this might also increase compile times...
There is no recipe for good scheduling. I do it by looking at the output of compile_to_lowered_stmt and profiling the code. I also use the AOT compilation provided by Halide::Generator, this makes sure that I only measure the runtime of the code and not the compile time.
Your other question was, how to wrap an existing array in a Halide::Image. I don't do that, mostly because I use AOT compilation. However, internally Halide uses a type called buffer_t for everything image related. There is also C++ wrapper called Halide::Buffer that makes using buffer_t a little easier, I think it can also be used in Func::realize instead of Halide::Image. The point is: If you understand buffer_t you can wrap almost everything into something digestible by Halide.

To emphasize the first thing Florian mentioned, which I think is the key point of misunderstanding here: you appear to be timing the compilation of the copy operation ("pipeline," in common Halide terms), not just its execution. Your code size estimate is presumably also for the whole binary resulting from copy.cpp, not just the code in the Halide-generated copy function (which won't actually even appear in the binary you're compiling with clang, since it is only constructed by JITing at runtime in this program).
You can observe the actual cost of your pipeline here by first calling copy.compile_jit() before realize (realize implicitly calls compile_jit the first time it is run, so it's not necessary, but it's valuable to factor apart the runtime from the compile overhead). You would then put your timer exclusively around realize.
If you actually want to pre-compile this (or any other) pipeline for static linking into your ultimate program, which is what it seems you might be expecting, what you really want to do is use Func::compile_to_file in one program to compile and emit the code (as copy.h and copy.o), and then link and call these in another program. Check out tutorial lesson 10 to see this in more detail:
https://github.com/halide/Halide/blob/master/tutorial/lesson_10_aot_compilation_generate.cpp https://github.com/halide/Halide/blob/master/tutorial/lesson_10_aot_compilation_run.cpp

Related

How much do C/C++ compilers optimize conditional statements?

I recently ran into a situation where I wrote the following code:
for(int i = 0; i < (size - 1); i++)
{
// do whatever
}
// Assume 'size' will be constant during the duration of the for loop
When looking at this code, it made me wonder how exactly the for loop condition is evaluated for each loop. Specifically, I'm curious as to whether or not the compiler would 'optimize away' any additional arithmetic that has to be done for each loop. In my case, would this code get compiled such that (size - 1) would have to be evaluated for every loop iteration? Or is the compiler smart enough to realize that the 'size' variable won't change, thus it could precalculate it for each loop iteration.
This then got me thinking about the general case where you have a conditional statement that may specify more operations than necessary.
As an example, how would the following two pieces of code compile:
if(6)
if(1+1+1+1+1+1)
int foo = 1;
if(foo + foo + foo + foo + foo + foo)
How smart is the compiler? Will the 3 cases listed above be converted into the same machine code?
And while I'm at, why not list another example. What does the compiler do if you are doing an operation within a conditional that won't have any effect on the end result? Example:
if(2*(val))
// Assume val is an int that can take on any value
In this example, the multiplication is completely unnecessary. While this case seems a lot stupider than my original case, the question still stands: will the compiler be able to remove this unnecessary multiplication?
Question:
How much optimization is involved with conditional statements?
Does it vary based on compiler?

Short answer: the compiler is exceptionally clever, and will generally optimise those cases that you have presented (including utterly ignoring irrelevant conditions).
One of the biggest hurdles language newcomers face in terms of truly understanding C++, is that there is not a one-to-one relationship between their code and what the computer executes. The entire purpose of the language is to create an abstraction. You are defining the program's semantics, but the computer has no responsibility to actually follow your C++ code line by line; indeed, if it did so, it would be abhorrently slow as compared to the speed we can expect from modern computers.
Generally speaking, unless you have a reason to micro-optimise (game developers come to mind), it is best to almost completely ignore this facet of programming, and trust your compiler. Write a program that takes the inputs you want, and gives the outputs you want, after performing the calculations you want… and let your compiler do the hard work of figuring out how the physical machine is going to make all that happen.
Are there exceptions? Certainly. Sometimes your requirements are so specific that you do know better than the compiler, and you end up optimising. You generally do this after profiling and determining what your bottlenecks are. And there's also no excuse to write deliberately silly code. After all, if you go out of your way to ask your program to copy a 50MB vector, then it's going to copy a 50MB vector.
But, assuming sensible code that means what it looks like, you really shouldn't spend too much time worrying about this. Because modern compilers are so good at optimising, that you'd be a fool to try to keep up.

The C++ language specification permits the compiler to make any optimization that results in no observable changes to the expected results.
If the compiler can determine that size is constant and will not change during execution, it can certainly make that particular optimization.
Alternatively, if the compiler can also determine that i is not used in the loop (and its value is not used afterwards), that it is used only as a counter, it might very well rewrite the loop to:
for(int i = 1; i < size; i++)
because that might produce smaller code. Even if this i is used in some fashion, the compiler can still make this change and then adjust all other usage of i so that the observable results are still the same.
To summarize: anything goes. The compiler may or may not make any optimization change as long as the observable results are the same.

Yes, there is a lot of optimization, and it is very complex.
It varies based on the compiler, and it also varies based on the compiler options
Check
https://meta.stackexchange.com/questions/25840/can-we-stop-recommending-the-dragon-book-please
for some book recomendations if you really want to understand what a compiler may do. It is a very complex subject.
You can also compile to assembly with the -S option (gcc / g++) to see what the compiler is really doing. Use -O3 / ... / -O0 / -O to experiment with different optimization levels.

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.

Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.

Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

What makes EXE's grow in size?

My executable was 364KB in size. It did not use a Vector2D class so I implemented one with overloaded operators.
I changed most of my code from
point.x = point2.x;
point.y = point2.y;
to
point = point2;
This resulted in removing nearly 1/3 of my lines of code and yet my exe is still 364KB. What exactly causes it to grow in size?

The compiler probably optimised your operator overload by inlining it. So it effectively compiles to the same code as your original example would. So you may have cut down a lot of lines of code by overloading the assignment operator, but when the compiler inlines, it takes the contents of your assignment operator and sticks it inline at the calling point.
Inlining is one of the ways an executable can grow in size. It's not the only way, as you can see in other answers.

What makes EXE’s grow in size?
External libraries, especially static libraries and debugging information, total size of your code, runtime library. More code, more libraries == larger exe.
To reduce size of exe, you need to process exe with gnu strip utility, get rid of all static libraries, get rid of C/C++ runtime libraries, disable all runtime checks and turn on compiler size optimizations. Working without CRT is a pain, but it is possible. Also there is a wcrt (alternative C runtime) library created for making small applications (by the way, it hasn't been updated/maintained during last 5 years).
The smallest exe that I was able create with msvc compiler is somewhere around 16 kilobytes. This was a windows application that displayed single window and required msvcrt.dll to run. I've modified it a bit, and turned it into practical joke that wipes out picture on monitor.
For impressive exe size reduction techniques, you may want to look at .kkrieger. It is a 3D first person shooter, 96 kilobytes total. The game has a large and detailed level, supports shaders, real-time shadows, etc. I.e. comparable with Saurbraten (see screenshots). The smallest working windows application (3d demo with music) I ever encountered was 4 kilobytes big, and used compression techniques and (probably) undocumented features (i.e. the fact that *.com executbale could unpack and launch win32 exe on windows xp)..
In most cases, size of *.exe shouldn't really bother you (I haven't seen a diskette for a few years), as long as it is reasonable (below 100 megabytes). For example of "unreasonable" file size see debug build of Qt 4 for mingw.
This resulted in removing nearly 1/3 of my lines of code and yet my exe is still 364KB.
Most likely it is caused by external libraries used by compiler, runtime checks, etc.
Also, this is an assignment operation. If you aren't using custom types for x (with copy constructor), "copy" operation is very likely to result in small number of operations - i.e. removing 1/3 of lines doesn't guarantee that your code will be 1/3 shorter.
If you want to see how much impact your modification made, you could "ask" compiler to produce asm listing for both versions of the program then compare results (manually or with diff). Or you could disasm/compare both versions of executable. BUt I'm certain that using GNU strip or removing extra libraries will have more effect than removing assignment operators.

What type is point? If it's two floats, then the compiler will implicitly do a member-by-member copy, which is the same thing you did before.
EDIT: Apparently some people in today's crowd didn't understand this answer and compensated by downvoting. So let me elaborate:
Lines of code have NO relation to the executable size. The source code tells the compiler what assembly line to create. One line of code can cause hundreds if not thousands of assembly instructions. This is particularly true in C++, where one line can cause implicit object construction, destruction, copying, etc.
In this particular case, I suppose that "point" is a class with two floats, so using the assignment operator will perform a member-by-member copy, i.e. it takes every member individually and copies it. Which is exactly the same thing he did before, except that now it's done implicitly. The resulting assembly (and thus executable size) is the same.

Executables are most often sized in 'pages' rather than discrete bytes.

I think this a good example why one shouldn't worry too much about code being too verbose if you have a good optimizing compiler. Instead always code clearly so that fellow programmers can read your code and leave the optimization to the compiler.

Some links to look into
http://www2.research.att.com/~bs/bs_faq.html#Hello-world
GCC C++ "Hello World" program -> .exe is 500kb big when compiled on Windows. How can I reduce its size?
http://www.catch22.net/tuts/minexe
As for Windows, lots of compiler options in VC++ may be activated like RTTI, exception handling, buffer checking, etc. that may add more behind the scenes to the overall size.

When you compile a c or c++ program into an executable, the compiler translates your code into machine code, and applying optimizations as it sees fit.
But simply, more code = more machine code to generate = more size to the executable.

Also, check if you have lot of static/global objects. This substantially increase your exe size if they are not zero initialized.
For example:
int temp[100] = {0};
int main()
{
}
size of the above program is 9140 bytes on my linux machine.
if I initialize temp array to 5, then the size will shoot up by around 400 bytes. The size of the below program on my linux machine is 9588.
int temp[100] = {5};
int main()
{
}
This is because, zero initialized global objects go into .bss segment, which ill be initialized at once during program startup. Where as non zero initialized objects contents will be embedded in the exe itself.

Force compiler to not optimize side-effect-less statements

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)
Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.
But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:
void float_to_int(float f)
{
int i = static_cast<int>(f); // has no side-effects
}
Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.
The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.
I'm using Visual Studio 2008 / G++ (3.4.4).
Edit
To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.
Edit Again
To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.
We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.
So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.
Premature quoting of Knuth is the root of all annoyance.

Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:
static volatile int i = 0;
void float_to_int(float f)
{
i = static_cast<int>(f); // has no side-effects
}

So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements
You are by definition skewing the results.
Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.

In this case I suggest you make the function return the integer value:
int float_to_int(float f)
{
return static_cast<int>(f);
}
Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.
extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);
Now compare this to an empty function like:
int take_float_return_int(float /* f */)
{
return 1;
}
Which should also be external.
The difference in times should give you an idea of the expense of what you're trying to measure.

What always worked on all compilers I used so far:
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
note that this skews results, boith methods should write to writeMe.
volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:
extern volatile float readMe = 0;
extern volatile int writeMe = 0;
void float_to_int(float f)
{
writeMe = static_cast<int>(f);
}
int main()
{
readMe = 17;
float_to_int(readMe);
}
Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.
Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)

Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
As others have suggested, you will need to examine the assemnler output to check that this approach actually works.

You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.
These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.

a function call incurs quite a bit of overhead, so I would remove this anyway.
adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).
Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.
R
p.s. found this too:
inline float pslNegFabs32f(float x){
__asm{
fld x //Push 'x' into st(0) of FPU stack
fabs
fchs //change sign
fstp x //Pop from st(0) of FPU stack
}
return x;
}
supposedly also very fast. You might want to profile this too. (although it is hardly portable code)

Return the value?
int float_to_int(float f)
{
return static_cast<int>(f); // has no side-effects
}
and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.
You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.

If you are using Microsoft's compiler - cl.exe, you can use the following statement to turn optimization on/off on a per-function level [link to doc].
#pragma optimize("" ,{ on |off })
Turn optimizations off for functions defined after the current line:
#pragma optimize("" ,off)
Turn optimizations back on:
#pragma optimize("" ,on)
For example, in the following image, you can notice 3 things.
Compiler optimizations flag is set - /O2, so code will get optimized.
Optimizations are turned off for first function - square(), and turned back on before square2() is defined.
Amount of assembly code generated for 1st function is higher. In second function there is no assembly code generated for int i = num; statement in code.
Thus while 1st function is not optimized, the second function is.
See https://godbolt.org/z/qJTBHg for link to this code on compiler explorer.
A similar directive exists for gcc too - https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html

A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.

GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.
I know the Intel compiler is also extremely good at these kinds of optimizations.
My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.
Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.
All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.
Check the AMD and Intel processor programming manuals for all the gritty details.

How to correctly benchmark a [templated] C++ program

< backgound>
I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.
< /background>
Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less).
Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.
However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.
Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).
The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.
If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.
So how can I do a correct benchmark of a little feature extracted from a library ?
Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?
Thanks for advices !
There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.
I think I will try all of these.
Thanks a lot for all your answers!

If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.
template<typename T> void sink(T const& t) {
volatile T sinkhole = t;
}
No iostream overhead, just a copy that has to remain in the generated code.
Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.

Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.
Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.

Compilers are only allowed to eliminate code-branches that can not happen. As long as it cannot rule out that a branch should be executed, it will not eliminate it. As long as there is some data dependency somewhere, the code will be there and will be run. Compilers are not too smart about estimating which aspects of a program will not be run and don't try to, because that's a NP problem and hardly computable. They have some simple checks such as for if (0), but that's about it.
My humble opinion is that you were possibly hit by some other problem earlier on, such as the way C/C++ evaluates boolean expressions.
But anyways, since this is about a test of speed, you can check that things get called for yourself - run it once without, then another time with a test of return values. Or a static variable being incremented. At the end of the test, print out the number generated. The results will be equal.
To answer your question about in-vitro testing: Yes, do that. If your app is so time-critical, do that. On the other hand, your description hints at a different problem: if your deltas are in a timeframe of 1e-3 seconds, then that sounds like a problem of computational complexity, since the method in question must be called very, very often (for few runs, 1e-3 seconds is neglectible).
The problem domain you are modeling sounds VERY complex and the datasets are probably huge. Such things are always an interesting effort. Make sure that you absolutely have the right data structures and algorithms first, though, and micro-optimize all you want after that. So, I'd say look at the whole context first. ;-)
Out of curiosity, what is the problem you are calculating?

You have a lot of control on the optimizations for your compilation. -O1, -O2, and so on are just aliases for a bunch of switches.
From the man pages
-O2 turns on all optimization flags specified by -O. It also turns
on the following optimization flags: -fthread-jumps -falign-func‐
tions -falign-jumps -falign-loops -falign-labels -fcaller-saves
-fcrossjumping -fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks -fexpensive-optimizations -fgcse
-fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐
order-blocks -freorder-functions -frerun-cse-after-loop
-fsched-interblock -fsched-spec -fschedule-insns -fsched‐
ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre
-ftree-vrp
You can tweak and use this command to help you narrow down which options to investigate.
...
Alternatively you can discover which binary optimizations are
enabled by -O3 by using:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled
Once you find the culpret optimization you shouldn't need the cout's.

If this is possible for you, you might try splitting your code into:
the library you want to test compiled with all optimizations turned on
a test program, dinamically linking the library, with optimizations turned off
Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).

You could create a dummy function in a separate cpp file that does nothing, but takes as argument whatever is the type of your calculation result. Then you can call that function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of invoking a function (which shouldn't skew your results unless you call it a lot!).

#include <iostream>
// Mark coords as extern.
// Compiler is now NOT allowed to optimise away coords
// This it can not remove the loop where you initialise it.
// This is because the code could be used by another compilation unit
extern double coords[500][3];
double coords[500][3];
int main()
{
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
std::cout << "hello world !"<< std::endl;
return 0;
}

edit: the easiest thing you can do is simply use the data in some spurious way after the function has run and outside your benchmarks. Like,
StartBenchmarking(); // ie, read a performance counter
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
StopBenchmarking(); // what comes after this won't go into the timer
// this is just to force the compiler to use coords
double foo;
for (int j = 0 ; j < 500 ; ++j )
{
foo += coords[j][0] + coords[j][1] + coords[j][2];
}
cout << foo;
What sometimes works for me in these cases is to hide the in vitro test inside a function and pass the benchmark data sets through volatile pointers. This tells the compiler that it must not collapse subsequent writes to those pointers (because they might be eg memory-mapped I/O). So,
void test1( volatile double *coords )
{
//perform a simple initialization of all coordinates:
for (int i=0; i<1500; i+=3)
{
coords[i+0] = 3.23;
coords[i+1] = 1.345;
coords[i+2] = 123.998;
}
}
For some reason I haven't figured out yet it doesn't always work in MSVC, but it often does -- look at the assembly output to be sure. Also remember that volatile will foil some compiler optimizations (it forbids the compiler from keeping the pointer's contents in register and forces writes to occur in program order) so this is only trustworthy if you're using it for the final write-out of data.
In general in vitro testing like this is very useful so long as you remember that it is not the whole story. I usually test my new math routines in isolation like this so that I can quickly iterate on just the cache and pipeline characteristics of my algorithm on consistent data.
The difference between test-tube profiling like this and running it in "the real world" means you will get wildly varying input data sets (sometimes best case, sometimes worst case, sometimes pathological), the cache will be in some unknown state on entering the function, and you may have other threads banging on the bus; so you should run some benchmarks on this function in vivo as well when you are finished.

I don't know if GCC has a similar feature, but with VC++ you can use:
#pragma optimize
to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.

Just a small example of an unwanted optimization:
#include <vector>
#include <iostream>
using namespace std;
int main()
{
double coords[500][3];
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
cout << "hello world !"<< endl;
return 0;
}
If you comment the code from "double coords[500][3]" to the end of the for loop it will generate exactly the same assembly code (just tried with g++ 4.3.2). I know this example is far too simple, and I wasn't able to show this behavior with a std::vector of a simple "Coordinates" structure.
However I think this example still shows that some optimizations can introduce errors in the benchmark and I wanted to avoid some surprises of this kind when introducing new code in a library. It's easy to imagine that the new context might prevent some optimizations and lead to a very inefficient library.
The same should also apply with virtual functions (but I don't prove it here). Used in a context where a static link would do the job I'm pretty confident that decent compilers should eliminate the extra indirection call for the virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal.
Then I'll call it hundred of thousand times in a context where the compiler cannot guess what will be the exact type of the pointer and have a 20% increase of running time...

at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;
The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js