Eigen: Coding style's effect on performance - c++

From what I've read about Eigen (here), it seems that operator=() acts as a "barrier" of sorts for lazy evaluation -- e.g. it causes Eigen to stop returning expression templates and actually perform the (optimized) computation, storing the result into the left-hand side of the =.
This would seem to mean that one's "coding style" has an impact on performance -- i.e. using named variables to store the result of intermediate computations might have a negative effect on performance by causing some portions of the computation to be evaluated "too early".
To try to verify my intuition, I wrote up an example and was surprised at the results (full code here):
using ArrayXf = Eigen::Array <float, Eigen::Dynamic, Eigen::Dynamic>;
using ArrayXcf = Eigen::Array <std::complex<float>, Eigen::Dynamic, Eigen::Dynamic>;
float test1( const MatrixXcf & mat )
{
ArrayXcf arr = mat.array();
ArrayXcf conj = arr.conjugate();
ArrayXcf magc = arr * conj;
ArrayXf mag = magc.real();
return mag.sum();
}
float test2( const MatrixXcf & mat )
{
return ( mat.array() * mat.array().conjugate() ).real().sum();
}
float test3( const MatrixXcf & mat )
{
ArrayXcf magc = ( mat.array() * mat.array().conjugate() );
ArrayXf mag = magc.real();
return mag.sum();
}
The above gives 3 different ways of computing the coefficient-wise sum of magnitudes in a complex-valued matrix.
test1 sort of takes each portion of the computation "one step at a time."
test2 does the whole computation in one expression.
test3 takes a "blended" approach -- with some amount of intermediate variables.
I sort of expected that since test2 packs the entire computation into one expression, Eigen would be able to take advantage of that and globally optimize the entire computation, providing the best performance.
However, the results were surprising (numbers shown are in total microseconds across 1000 executions of each test):
test1_us: 154994
test2_us: 365231
test3_us: 36613
(This was compiled with g++ -O3 -- see the gist for full details.)
The version I expected to be fastest (test2) was actually slowest. Also, the version that I expected to be slowest (test1) was actually in the middle.
So, my questions are:
Why does test3 perform so much better than the alternatives?
Is there a technique one can use (short of diving into the assembly code) to get some visibility into how Eigen is actually implementing your computations?
Is there a set of guidelines to follow to strike a good tradeoff between performance and readability (use of intermediate variables) in your Eigen code?
In more complex computations, doing everything in one expression could hinder readability, so I'm interested in finding the right way to write code that is both readable and performant.

It looks like a problem of GCC. Intel compiler gives the expected result.
$ g++ -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a
test1_us: 200087
test2_us: 320033
test3_us: 44539
$ icpc -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a
test1_us: 214537
test2_us: 23022
test3_us: 42099
Compared to the icpc version, gcc seems to have problem optimizing your test2.
For more precise result, you may want to turn off the debug assertions by -DNDEBUG as shown here.
EDIT
For question 1
#ggael gives an excellent answer that gcc fails vectorizing the sum loop. My experiment also find that test2 is as fast as the hand-written naive for-loop, both with gcc and icc, suggesting that vectorization is the reason, and no temporary memory allocation is detected in test2 by the method mentioned below, suggesting that Eigen evaluate the expression correctly.
For question 2
Avoiding the intermediate memory is the main purpose that Eigen use expression templates. So Eigen provides a macro EIGEN_RUNTIME_NO_MALLOC and a simple function to enable you check whether an intermediate memory is allocated during calculating the expression. You can find a sample code here. Please note this may only work in debug mode.
EIGEN_RUNTIME_NO_MALLOC - if defined, a new switch is introduced which
can be turned on and off by calling set_is_malloc_allowed(bool). If
malloc is not allowed and Eigen tries to allocate memory dynamically
anyway, an assertion failure results. Not defined by default.
For question 3
There is a way to use intermediate variables, and to get the performance improvement introduced by lazy evaluation/expression templates at the same time.
The way is to use intermediate variables with correct data type. Instead of using Eigen::Matrix/Array, which instructs the expression to be evaluated, you should use the expression type Eigen::MatrixBase/ArrayBase/DenseBase so that the expression is only buffered but not evaluated. This means you should store the expression as intermediate, rather than the result of the expression, with the condition that this intermediate will only be used once in the following code.
As determing the template parameters in the expression type Eigen::MatrixBase/... could be painful, you could use auto instead. You could find some hints on when you should/should not use auto/expression types in this page. Another page also tells you how to pass the expressions as function parameters without evaluating them.
According to the instructive experiment about .abs2() in #ggael 's answer, I think another guideline is to avoid reinventing the wheel.

What happens is that because of the .real() step, Eigen won't explicitly vectorize test2. It will thus call the standard complex::operator* operator, which, unfortunately, is never inlined by gcc. The other versions, on the other hand, uses Eigen's own vectorized product implementation of complexes.
In contrast, ICC does inline complex::operator*, thus making the test2 the fastest for ICC. You can also rewrite test2 as:
return mat.array().abs2().sum();
to get even better performance on all compilers:
gcc:
test1_us: 66016
test2_us: 26654
test3_us: 34814
icpc:
test1_us: 87225
test2_us: 8274
test3_us: 44598
clang:
test1_us: 87543
test2_us: 26891
test3_us: 44617
The extremely good score of ICC in this case is due to its clever auto-vectorization engine.
Another way to workaround the inlining failure of gcc without modifying test2 is to define your own operator* for complex<float>. For instance, add the following at the top of your file:
namespace std {
complex<float> operator*(const complex<float> &a, const complex<float> &b) {
return complex<float>(real(a)*real(b) - imag(a)*imag(b), imag(a)*real(b) + real(a)*imag(b));
}
}
and then I get:
gcc:
test1_us: 69352
test2_us: 28171
test3_us: 36501
icpc:
test1_us: 93810
test2_us: 11350
test3_us: 51007
clang:
test1_us: 83138
test2_us: 26206
test3_us: 45224
Of course, this trick is not always recommended as, in contrast to the glib version, it might lead to overflow or numerical cancellation issues, but this what icpc and the other vectorized versions compute anyway.

One thing I have done before is to make use of the auto keyword a lot. Keeping in mind that most Eigen expressions return special expression datatypes (e.g. CwiseBinaryOp), an assignment back to a Matrix may force the expression to be evaluated (which is what you are seeing). Using auto allows the compiler to deduce the return type as whatever expression type it is, which will avoid evaluation as long as possible:
float test1( const MatrixXcf & mat )
{
auto arr = mat.array();
auto conj = arr.conjugate();
auto magc = arr * conj;
auto mag = magc.real();
return mag.sum();
}
This should essentially be closer to your second test case. In some cases I have had good performance improvements while keeping readability (you do not want to have to spell out the expression template types). Of course, your mileage may vary, so benchmark carefully :)

I just want you to note that you did profiling in a non-optimal way, so actually the issue could just be your profiling method.
Since there are many things like cache locality to keep into account you should do the profiling that way:
int warmUpCycles = 100;
int profileCycles = 1000;
// TEST 1
for(int i=0; i<warmUpCycles ; i++)
doTest1();
auto tick = std::chrono::steady_clock::now();
for(int i=0; i<profileCycles ; i++)
doTest1();
auto tock = std::chrono::steady_clock::now();
test1_us = (std::chrono::duration_cast<std::chrono::microseconds>(tock-tick)).count();
// TEST 2
// TEST 3
Once you did the test in the proper way, then you can come to conclusions..
I highly suspect that since you are profiling one operation at a time, you ends up by using the cached version on the third test since operations are likely to be re-ordered by the compiler.
Also you should try different compilers to see if the problem is the unrolling of templates (there is a depth limit to optimizing templates: it is likely you can hit it with a single big expression).
Also if Eigen support move semantics, there's no reason why one version should be faster since it is not always guaranteed that expressions can be optimized.
Please try and let me know, that's interesting. Also be sure to have enabled optimizations with flags like -O3, profiling without optimization is meaningless.
As to prevent compiler optimize everything away, use initial input from a file or cin and then re-feed the input inside the functions.

Related

speeding up complex-number multiplication in C++

I have some code which multiplies complex numbers, and have noticed that mulxc3 (long double version of muldc3) is being called frequently: i.e. the complex number multiplications are not being inlined.
I am compiling with g++ version 7.5, with -O3 and --ffast-math.
It is similar to this question, except the problem persists when I compile with -ffast-math. Since I do not require checking for whether the arguments are Inf or NaN, I was considering making my own very simple complex class without such checks to allow the multiplication to be inlined, but given my lack of C++ proficiency, and having read this article makes me think that would be counterproductive.
So, is there a way to change either my code or compilation process so that I can keep using std::complex, but inline the multiplication?

Will an operation done several times in sequence be simplified by compiler?

I've had this question for a long time but never knew where to look. If a certain operation is written many times will the compiler simplify it or will it run the exact same operation and get the exact same answer?
For example, in the following c-like pseudo-code (i%3)*10 is repeated many times.
for(int i=0; i<100; i++) {
array[(i%3)*10] = someFunction((i%3)*10);
int otherVar = (i%3)*10 + array[(i%3)*10];
int lastVar = (i%3)*10 - otherVar;
anotherFunction(lastVar);
}
I understand a variable would be better for visual purposes, but is it also faster? Is (i%3)*10 calculated 5 times per loop?
There are certain cases where I don't know if its faster to use a variable or just leave the original operation.
Edit: using gcc (MinGW.org GCC-8.2.0-3) 8.2.0 on win 10
Which optimizations are done depends on the compiler, the compiler optimization flag(s) you specify, and the architecture.
Here are a few possible optimizations for your example:
Loop Unrolling This makes the binary larger and thus is a trade-off; for example you may not want this on a tiny microprocessor with very little memory.
Common Subexpression Elimination (CSE) you can be pretty sure that your (i % 3) * 10 will only be executed once per loop iteration.
About your concern about visual clarity vs. optimization: When dealing with a 'local situation' like yours, you should focus on code clarity.
Optimization gains are often to be made at a higher level; for example in the algorithm you use.
There's a lot to be said about optimization; the above are just a few opening remarks. It's great that you're interested in how things work, because this is important for a good (C/C++) programmer.
As a matter of course, you should remove the obfuscation present in your code:
for (int i = 0; i < 100; ++i) {
int i30 = i % 3 * 10;
int r = someFunction(i30);
array[i30] = r;
anotherFunction(-r);
}
Suddenly, it looks quite a lot simpler.
Leave it to the compiler (with appropriate options) to optimize your code unless you find you actually have to take a hand after measuring.
In this case, unrolling three times looks like a good idea for the compiler to pursue. Though inlining might always reveal even better options.
Yes, operations done several times in sequence will be optimized by a compiler.
To go into more detail, all major compilers (GCC, Clang, and MSVC) store the value of (i%3)*10 into a temporary (scratch, junk) register, and then use that whenever an equivalent expression is used again.
This optimization is called GCSE (GNU Common Subexpression Elimination) for GCC, and just CSE otherwise.
This takes a decent chunk out of the time that it takes to compute the loop.

`boost::simd::bitwise_and` and type compatibility

I use boost::simd for my program. Curiously, the whole program runs in fact slower with the use of boost::simd compared to without. I managed to track down the line that causes the overwhelming majority of CPU runtime:
using pack_t = boost::simd::pack<double>;
using logical_pack_t = boost::simd::pack<boost::simd::logical<double>, pack_t::static_size>;
using iters_pack_t = boost::simd::pack<std::uint64_t, pack_t::static_size>;
static_assert(sizeof(double) == sizeof(std::uint64_t), "mismatch of pack sizes");
const iters_pack_t zero(0);
const iters_pack_t one(1);
iters_pack_t increment(1);
logical_pack_t condition = /* ... */;
increment = boost::simd::if_else(condition, one, zero); // bottleneck
increment = boost::simd::bitwise_and(increment, condition); // better version, doesn't compile
As stated in the source code, I assume that bitwise_and should bring a performance boost. However, when trying to compile that variant, my compiler prints pages of cryptic error messages (as always with TMP-based libraries). I suppose that this is due to the fact that increment and condition are not of the same type. This assumption is supported by the fact that the code compiles once I change that line to the nonsensical increment = boost::simd::bitwise_and(increment, increment);.
The documentation states that both operands must solely share the same bit size, which they do in my case. Therefore, I don't understand why my code doesn't compile.
Compiled with -march=native on an Ivy Bridge (AVX support but not AVX2).
Two things :
are you using the old boost.SIMD living in NT2 or the new autonomous version at https://github.com/NumScale/boost.simd ? The latest should have better performances.
The best course of action is to use boost::simd::if_inc that do the correct optimization whenever its possible by using the bitwise representation of True (aka -1) to do the computation.

Is it good practice to construct long circuit statements?

Question Context: [C++] I want to know what is theoretically the fastest, and what the compiler will do. I don't want to hear about premature optimization is the root of all evil, etc.
I was writing some code like this:
bool b0 = ...;
bool b1 = ...;
if (b0 && b1)
{
...
}
But then I was thinking: the code, as-is, will compile into two TEST instructions, if compiled without optimizations. This means two branches. So I was thinking that it might be better to write:
if (b0 & b1)
Which will produce only one TEST instruction, if no optimization is done by the compiler. But then I feel that this is against my code-style. I usually write && and ||.
Q: What will the compiler do if I turn on optimization flags (-O1, -O2, -O3, -Os and -Ofast). Will the compiler automatically compile it like &, even if I have used a && in the code? And what is theoretically faster? Does the behavior change if I do this:
if (b0 && b1)
{ ... }
else if (b0)
{ ... }
else if (b1)
{ ... }
else
{ ... }
Q: As I could have guessed, this is very depended on the situation, but is it a common trick for a compiler to replace a && with a &?
Q: What will the compiler do if I turn on optimization flags (-O1, -O2, -O3, -Os and -Ofast).
Most likely nothing more to increase the optimization.
As stated in my comments, you really can't optimize the evaluation any further than:
AND B0 WITH B1 (sets condition flags)
JUMP ZERO TO ...
Although, if you have a lot of simple boolean logic or data operations, some processors may conditionally execute them.
Will the compiler automatically compile it like &, even if I have used a && in the code?
And what is theoretically faster?
In most platforms, there is no difference in evaluation of A & B versus A && B.
In the final evaluation, either a compare or an AND instruction is executed, then a jump based on the status. Two instructions.
Most processors don't have Boolean registers. It's all numbers and bits.
Optimize By Boolean Logic
Your best option is to review the design and set up your algorithms to use Boolean algebra. You can than simplify the Boolean expressions.
Another option is to implement the code so that the compiler can generate conditional assembly instructions, if the platform supports them.
Optimize: Reduce jumps
Processors favor arithmetic and data transfers over jumps.
Many processors are always feeding an instruction pipeline. When it comes to a conditional branch instruction, the processor has to wait (suspend the instruction prefetching) until the condition status is determined. Then it can determine where the next instruction will be fetched.
If you can't remove the jumps, such as in a loop, make the ratio of data processing to jumping bigger in the data side. Search for "Loop Unrolling". Many compilers will perform this when optimization levels are increased.
Optimize: Data Cache
You may notice increased performance by organizing your data for best data cache usage.
For example, instead of 3 large arrays, use one array of a structure containing 3 elements. This allows the elements in use to be close to each other (and reduce the likelihood of accessing data outside of the cache).
Summary
The difference in evaluation of A && B versus A & B as conditional expressions is known as a micro-optimization. You will achieve improved performance by using Boolean algebra to reduce the quantity of conditional expressions. Jumps, or changes in execution path, slow down instruction execution. Fetching data outside of the data cache also slows down execution. You will most likely get better performance by redesigning your code and helping the compiler to reduce the branches and more effective use of the data cache.
If you care about what's fastest, why do you care what the compiler will do without optimisation?
Q: As I could have guessed, this is very depended on the situation, but is it a common trick for a compiler to replace a && with a &?
This question seems to assume that the compiler transforms C++ code into more C++ code. It doesn't. It transforms your code into machine instructions (including the assembler as part of the compiler for argument's sake). You should not assume there is a one-to-one mapping from a C++ operator like && or & to a particular instruction.
With optimisation the compiler will do whatever it thinks will be faster. If a single instruction would be faster the compiler will generate a single instruction for if (b0 && b1), you don't need to bugger up your code with micro-optimisations to help it make such a simple transformation.
The compiler knows the instruction set it's using, it knows the context the condition is in and whether it can be removed entirely as dead code, or moved elsewhere to help the pipeline, or simplified by constant propagation, etc. etc.
And if you really care about what's fastest, why would you compute b1 until you know it's actually needed? If obtaining the value of b1 has no side effects the compiler could even transform your code to:
bool b0 = ...;
if (b0)
{
bool b1 = ...;
if (b1)
{
Does that mean two if conditions are faster than a &?! Of course not.
In other words, the whole premise of the question is flawed. Do not compromise the readability and simplicity of your code in the misguided pursuit of the "theoretically fastest" micro-optimisation. Spend your time improving the algorithms and data structures used not trying to second guess which instructions the compiler will generate.

How to correctly benchmark a [templated] C++ program

< backgound>
I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.
< /background>
Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less).
Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.
However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.
Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).
The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.
If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.
So how can I do a correct benchmark of a little feature extracted from a library ?
Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?
Thanks for advices !
There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.
I think I will try all of these.
Thanks a lot for all your answers!
If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.
template<typename T> void sink(T const& t) {
volatile T sinkhole = t;
}
No iostream overhead, just a copy that has to remain in the generated code.
Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.
Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.
Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.
Compilers are only allowed to eliminate code-branches that can not happen. As long as it cannot rule out that a branch should be executed, it will not eliminate it. As long as there is some data dependency somewhere, the code will be there and will be run. Compilers are not too smart about estimating which aspects of a program will not be run and don't try to, because that's a NP problem and hardly computable. They have some simple checks such as for if (0), but that's about it.
My humble opinion is that you were possibly hit by some other problem earlier on, such as the way C/C++ evaluates boolean expressions.
But anyways, since this is about a test of speed, you can check that things get called for yourself - run it once without, then another time with a test of return values. Or a static variable being incremented. At the end of the test, print out the number generated. The results will be equal.
To answer your question about in-vitro testing: Yes, do that. If your app is so time-critical, do that. On the other hand, your description hints at a different problem: if your deltas are in a timeframe of 1e-3 seconds, then that sounds like a problem of computational complexity, since the method in question must be called very, very often (for few runs, 1e-3 seconds is neglectible).
The problem domain you are modeling sounds VERY complex and the datasets are probably huge. Such things are always an interesting effort. Make sure that you absolutely have the right data structures and algorithms first, though, and micro-optimize all you want after that. So, I'd say look at the whole context first. ;-)
Out of curiosity, what is the problem you are calculating?
You have a lot of control on the optimizations for your compilation. -O1, -O2, and so on are just aliases for a bunch of switches.
From the man pages
-O2 turns on all optimization flags specified by -O. It also turns
on the following optimization flags: -fthread-jumps -falign-func‐
tions -falign-jumps -falign-loops -falign-labels -fcaller-saves
-fcrossjumping -fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks -fexpensive-optimizations -fgcse
-fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐
order-blocks -freorder-functions -frerun-cse-after-loop
-fsched-interblock -fsched-spec -fschedule-insns -fsched‐
ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre
-ftree-vrp
You can tweak and use this command to help you narrow down which options to investigate.
...
Alternatively you can discover which binary optimizations are
enabled by -O3 by using:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled
Once you find the culpret optimization you shouldn't need the cout's.
If this is possible for you, you might try splitting your code into:
the library you want to test compiled with all optimizations turned on
a test program, dinamically linking the library, with optimizations turned off
Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).
You could create a dummy function in a separate cpp file that does nothing, but takes as argument whatever is the type of your calculation result. Then you can call that function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of invoking a function (which shouldn't skew your results unless you call it a lot!).
#include <iostream>
// Mark coords as extern.
// Compiler is now NOT allowed to optimise away coords
// This it can not remove the loop where you initialise it.
// This is because the code could be used by another compilation unit
extern double coords[500][3];
double coords[500][3];
int main()
{
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
std::cout << "hello world !"<< std::endl;
return 0;
}
edit: the easiest thing you can do is simply use the data in some spurious way after the function has run and outside your benchmarks. Like,
StartBenchmarking(); // ie, read a performance counter
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
StopBenchmarking(); // what comes after this won't go into the timer
// this is just to force the compiler to use coords
double foo;
for (int j = 0 ; j < 500 ; ++j )
{
foo += coords[j][0] + coords[j][1] + coords[j][2];
}
cout << foo;
What sometimes works for me in these cases is to hide the in vitro test inside a function and pass the benchmark data sets through volatile pointers. This tells the compiler that it must not collapse subsequent writes to those pointers (because they might be eg memory-mapped I/O). So,
void test1( volatile double *coords )
{
//perform a simple initialization of all coordinates:
for (int i=0; i<1500; i+=3)
{
coords[i+0] = 3.23;
coords[i+1] = 1.345;
coords[i+2] = 123.998;
}
}
For some reason I haven't figured out yet it doesn't always work in MSVC, but it often does -- look at the assembly output to be sure. Also remember that volatile will foil some compiler optimizations (it forbids the compiler from keeping the pointer's contents in register and forces writes to occur in program order) so this is only trustworthy if you're using it for the final write-out of data.
In general in vitro testing like this is very useful so long as you remember that it is not the whole story. I usually test my new math routines in isolation like this so that I can quickly iterate on just the cache and pipeline characteristics of my algorithm on consistent data.
The difference between test-tube profiling like this and running it in "the real world" means you will get wildly varying input data sets (sometimes best case, sometimes worst case, sometimes pathological), the cache will be in some unknown state on entering the function, and you may have other threads banging on the bus; so you should run some benchmarks on this function in vivo as well when you are finished.
I don't know if GCC has a similar feature, but with VC++ you can use:
#pragma optimize
to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.
Just a small example of an unwanted optimization:
#include <vector>
#include <iostream>
using namespace std;
int main()
{
double coords[500][3];
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
cout << "hello world !"<< endl;
return 0;
}
If you comment the code from "double coords[500][3]" to the end of the for loop it will generate exactly the same assembly code (just tried with g++ 4.3.2). I know this example is far too simple, and I wasn't able to show this behavior with a std::vector of a simple "Coordinates" structure.
However I think this example still shows that some optimizations can introduce errors in the benchmark and I wanted to avoid some surprises of this kind when introducing new code in a library. It's easy to imagine that the new context might prevent some optimizations and lead to a very inefficient library.
The same should also apply with virtual functions (but I don't prove it here). Used in a context where a static link would do the job I'm pretty confident that decent compilers should eliminate the extra indirection call for the virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal.
Then I'll call it hundred of thousand times in a context where the compiler cannot guess what will be the exact type of the pointer and have a 20% increase of running time...
at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;
The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.