C++ Time measurement of functions - c++

I need to measure the time of a C++ programs, especially the overall running time of some recursive functions. There are a lot of function calls inside other functions. My first thought was to implement some time measurements functions in the actual code.
The problem with gprof is, that it prints out the time of class operators of a datatype, but i only need the infomartion about the functions and "-f func_name prog_name" wont work.
So, what is the most common way in science to measure time of a numerical program?
Its something like this:
function2()
{
}
function1()
{
function2();
funtcion1();
}
int main(){
function1();
}

If you're using the GNU package, i.e. gcc, you can try gprof. Just compile your program with -g and -pg flags and then run
gprof <your_program_name>
gprof: http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html
EDIT:
In order to increase the level of detail you can run gprof with other flags:
-l (--line) enables line by line profiling, giving you a histogram hits to be charged to individual lines of code, instead of functions.
-a Don’t include private functions in the output.
-e <function> Exclude output for a function <function>. Use this when there are functions that won’t be changed. For example, some sites have source code that’s been approved by a regulatory agency, and no matter how inefficient, the code will remain unchanged.
-E <function> Also exclude the time spent in the function from the percentage tables.
-f <function> The opposite of -e: only track time in <function>.
-F <function> Only use the time in <function> when calculating percentages.
-b Don’t print the explanatory text. If you’re more experienced, you can appreciate this option.
-s Accumulate samples. By running the program several times, it’s possible to get a
better picture of where time is spent. For example, a slow routine may not be called
for all input values, and therefore you maybe mislead reading where to find
performance problems.

If you need higher precision (for functions which do not take more than few (or less) milliseconds), you can use std::chrono::high_resolution_clock:
auto beginT = std::chrono::high_resolution_clock::now();
//Your computation here
auto endT = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(endT - beginT).count()
The std::chrono::high_resolution_clock can be found in chrono header and is part of C++11 stadard.

Related

Tuning the resolution in callgrind

Sorry, I can't create a minimal complete example as the problem only occurs for relatively large programs and I am not sure this is even a 'bug' per se as opposed to a misunderstanding of what callgrind profiling is supposed to accomplish.
I have a large program whose run time is split about 50 50 into 2 sequential parts. The first part is most file reading, and the second mostly computation.
The order of function calls that I would expect is the following:
Calling Scope, Callee
main Part1_main
Part1_main Part1_main_subfunction_1
Part1_main Part1_main_subfunction_2
Part1_main Part1_main_subfunction_3
main Part2_main
Part2_main Part2_main_subfunction_1
Part2_main Part2_main_subfunction_2
..
..
When I run callgrind on the code (and then view the results in kcachegrind on osx), I have some results regarding the function calls that are approximately as you would expect except for one thing: There is no resolution of function calls within the second part: The profile output is qualitatively the same as
Function, Pct_time, Self_time
Part1_main 50 4
Part2_main 50 50
Part1_main_subfunction_1 20 4
Part1_main_subfunction_2 15 5
..
..
..
What is the interpretation that the second function has very high self time? It seems that the profiler thinks that it is not calling any other functions. I suppose it is possible, though unlikely, that everything in function 2 is inlined, so maybe there shouldn't be any further resolution. If this is true, this doesn't yield very interesting profiling results.
If you ever come across this type of thing, how do you force the profiler to show further resolution? Or, if my intuition is wrong, what else could be causing this behaviour?
As per instruction of the callgrind website, I am compiling with the -g flag, and optimisation turned on.
As default kcachegrind hides functions with small weight, but you can customize it. See the answer here:
Make callgrind show all function calls in the kcachegrind callgraph

Removing code portions does not match the profiler's data

I am doing a little proof of concept profile and optimize type of example. However, I ran into something that I can't quite explain, and I'm hoping that someone here can clear this up.
I wrote a very short snippet of code:
int main (void)
{
for (int j = 0; j < 1000; j++)
{
a = 1;
b = 2;
c = 3;
for (int i = 0; i < 100000; i++)
{
callbackWasterOne();
callbackWasterTwo();
}
printf("Two a: %d, b: %d, c: %d.", a, b, c);
}
return 0;
}
void callbackWasterOne(void)
{
a = b * c;
}
void callbackWasterTwo(void)
{
b = a * c;
}
All it does is call two very basic functions that just multiply numbers together. Since the code is identical, I expect the profiler (oprofile) to return roughly the same number.
I run this code 10 times per profile, and I got the following values for how long time is spent on each function:
main: average = 5.60%, stdev = 0.10%
callbackWasterOne = 43.78%, stdev = 1.04%
callbackWasterTwo = 50.24%, stdev = 0.98%
rest is in miscellaneous things like printf and no-vmlinux
The difference between the time for callbackWasterOne and callbackWasterTwo is significant enough (to me at least) given that they have the same code, that I switched their order in my code and reran the profiler with the following results now:
main: average = 5.45%, stdev = 0.40%
callbackWasterOne = 50.69%, stdev = 0.49%
callbackWasterTwo = 43.54%, stdev = 0.18%
rest is in miscellaneous things like printf and no-vmlinux
So evidently the profiler samples one more than the other based on the execution order. Not good. Disregarding this, I decided to see the effects of removing some code and I got this for execution times (averages):
Nothing removed: 0.5295s
call to callbackWasterOne() removed from for loop: 0.2075s
call to callbackWasterTwo() removed from for loop: 0.2042s
remove both calls from for loop: 0.1903s
remove both calls and the for loop: 0.0025s
remove contents of callbackWasterOne: 0.379s
remove contents of callbackWasterTwo: 0.378s
remove contents of both: 0.382s
So here is what I'm having trouble understanding:
When I remove just one of the calls from the for loop, the execution time drops by ~60%, which is greater than the time spent by that one function + the main in the first place! How is this possible?
why is the effect of removing both calls from the loop so little compared to removing just one? I can't figure out this non-linearity. I understand that the for loop is expensive, but in that case (if most of the remaining time can be attributed to the for loop that performs the function calls), why would removing one of the calls cause such a large improvement in the first place?
I looked at the disassembly and the two functions are the same in code. The calls to them are the same, and removing the call simply deletes the one call line.
Other info that might be relevant
I'm using Ubuntu 14.04LTS
The code is complied by Eclipse with no optimization (O0)
I time the code by running it in terminal using "time"
I use OProfile with count = 10000 and 10 repetitions.
Here are the results from when I do this with -O1 optimization:
main: avg = 5.89%, stdev = 0.14%
callbackWasterOne: avg = 44.28%, stdev = 2.64%
callbackWasterTwo: avg = 49.66%, stdev = 2.54% (greater than before)
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.522s
Remove callbackWasterOne call: 0.149s (71.47% decrease)
Remove callbackWasterTwo call: 0.123% (76.45% decrease)
Remove both calls: 0.0365s (93.01% decrease) (what I would expect given the profile data just above)
So removing one call now is much better than before, and removing both still carries a benefit (probably because the optimizer understands that nothing happens in the loop). Still, removing one is much more beneficial than I would have anticipated.
Results of the two functions using different variables:
I defined 3 more variables for callbackWasterTwo() to use instead of reusing same ones. Now the results are what I would have expected.
main: avg = 10.87%, stdev = 0.19% (average is greater, but maybe due to those new variables)
callbackWasterOne: avg = 46.08%, stdev = 0.53%
callbackWasterTwo: avg = 42.82%, stdev = 0.62%
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.520s
Remove callbackWasterOne call: 0.292s (43.83% decrease)
Remove callbackWasterTwo call: 0.291% (44.07% decrease)
Remove both calls: 0.065s (87.55% decrease)
So now removing both calls is pretty much equivalent (within stdev) to removing one call + the other.
Since the result of removing either function is pretty much the same (43.83% vs 44.07%), I am going to go out on a limb and say that perhaps the profiler data (46% vs 42%) is still skewed. Perhaps it is the way it samples (going to vary the counter value next and see what happens).
It appears that the success of optimization relates pretty strongly to the code reuse fraction. The only way to achieve "exactly" (you know what I mean) the speedup noted by the profiler is to optimize on completely independent code. Anyway this is all interesting.
I am still looking for some explanation ideas for the 70% decrease in the -O1 case though...
I did this with 10 functions (different formulas in each, but using some combination of 6 different variables, 3 at a time, all multiplication):
These results are disappointing to say the least. I know the functions are identical, and yet, the profiler indicates that some take significantly longer. No matter which one I remove ("fast" or "slow" one), the results are the same ;) So this leaves me to wonder, how many people are incorrectly relying on the profiler to indicate the incorrect areas of code to fix? If I unknowingly saw these results, what could possible tell me to go fix the 5% function rather than the 20% (even though they are exactly the same)? What if the 5% one was much easier to fix, with a large potential benefit? And of course, this profiler might just not be very good, but it is popular! People use it!
Here is a screenshot. I don't feel like typing it in again:
My conclusion: I am overall rather disappointed with Oprofile. I decided to try out callgrind (valgrind) through command line on the same function and it gave me far more reasonable results. In fact, the results were very reasonable (all functions spent ~ the same amount of time executing). I think Callgrind samples far more than Oprofile ever did.
Callgrind will still not explain the difference in improvement when a function is removed, but at least it gives the correct baseline information...
Ah, I see you did look at the assembly. This question is indeed interesting in its own right, but in general there's no point profiling unoptimized code, since there's so much boilerplate that could easily be reduced even in -O1.
If it's really only the call that's missing, then that could explain the timing differences -- there's lots of boilerplate from the -O0 stack manipulation code (any caller-saved registers have to be pushed onto the stack, and any arguments too, then afterwards the any return value has to be treated and the opposite stack manipulation has to be done) which contributes to the time it takes to call the functions, but is not necessarily completely attributed to the functions themselves by oprofile since that code is executed before/after the function is actually called.
I suspect the reason the second function seems to always take less time is that there's less (or no) stack juggling that needs to be done -- the parameter values are already on the stack thanks to the previous function call, and so, as you've seen, only the call to the function has to be executed, without any other extra work.

How to get an objective evaluation of the execution time of a C++ code snippet?

I am following this post How to Calculate Execution Time of a Code Snippet in C++, and a nice solution is given in this post for calculating the execution time of a code snippet. However, when I use this solution to measure the execution time of my code snippet in linux, I found that everything I run the program, the execution time given by the solution is different. So my question is how I can have an objective evaluation of the execution time. The objective evaluation is important to me as I use the following scheme to evaluate the different implementation of the same task:
void main()
{
int64 begin,end;
begin = GetTimeMs64();
execute_my_codes_method1();
end = GetTimeMs64();
std::cout<<"Execution time is "<<end-begin<<std::endl;
}
First, I run the above code to get the execution time for the first method. After that, I will change the above codes by invoking execute_my_codes_method2() and get the execution time for the second method.
void main()
{
int64 begin,end;
begin = GetTimeMs64();
execute_my_codes_method2();//execute_my_codes_method1();
end = GetTimeMs64();
std::cout<<"Execution time is "<<end-begin<<std::endl;
}
By comparing the different execution time I expect to compare the efficiency of these two different implementations.
The reason why I changed the codes and run different implementations is because it is very difficult to call them sequentially in one program. Therefore, for the same program running it at different times will lead to different execution time means that comparing different implementation methods using the calculated execution time is meaningless. Any suggestions on this problem? Thanks.
Measuring a single call's execution time is pretty useless for judging any performance improvements. There are too many factors that influence the actual execution time of a function. If you are measure timing you should make many calls to the function measure the time and build a statistical average of the measured execution times
void main() {
int64 begin = 0, end = 0;
begin = GetTimeMs64();
for (int i = 0; i < 10000; ++i) {
execute_my_codes_method1();
}
end = GetTimeMs64();
std::cout<<"Average execution time is "<< (end - begin) / 10000 << std::endl;
}
Additionally instead of what's shown above, the presence of having unit tests for your functions up front (using a decent testing framework like e.g. Google Test), will making such quick judgments as you mention a lot quicker and easier.
Not only you can determine how often the test cases should be run (to gather the statistical data for average time calculation), the unit tests can also prove that the desired/existing functionality and input/output consistency wasn't broken by an alternate implementation.
As an extra benefit (as you mentioned difficulties running the two functions in question sequentially), most of those unit test frameworks allow to have a SetUp() and TearDown() method, that are executed before/after running a test case. Thus you can easily provide consistent state of predicate or invariant conditions for each single test case run.
As a further option, instead of measuring to gather the statistical data yourself, you can use profiling tools that work via code instrumentation. A good sample for this is GCC's gprof. I think there's information gathered for how often every underlying function was called and which time the execution took. This data can be analyzed later with the tool, to find potential bottlenecks in your implementations.
Additionally,- if you decide to provide unit tests in future -, you may want to ensure all of your code path's regarding various input data situations are covered well by your test cases. A very good example,for how to do this, is GCC's gcov instrumentation. To analyze the gathered information about code coverage you can use lcov, that visualizes the results quite nicely and comprehensive.

Is there a way to figure out the top callers of a C function?

Say I have function that is called a LOT from many different places. So I would like to find out who calls this functions the most. For example, top 5 callers or who ever calls this function more than N times.
I am using AS3 Linux, gcc 3.4.
For now I just put a breakpoint and then stop there after every 300 times, thus brute-forcing it...
Does anyone know of tools that can help me?
Thanks
Compile with -pg option, run the program for a while and then use gprof. Running a program compiled with -pg option will generate gmon.out file with execution profile. gprof can read this file and present it in readable form.
I wrote call logging example just for fun. A macro change the function call with an instrumented one.
include <stdio.h>.
int funcA( int a, int b ){ return a+b; }
// instrumentation
void call_log(const char*file,const char*function,const int line,const char*args){
printf("file:%s line: %i function: %s args: %s\n",file,line,function,args);
}
#define funcA(...) \
(call_log(__FILE__, __FUNCTION__, __LINE__, "" #__VA_ARGS__), funcA(__VA_ARGS__)).
// testing
void funcB(void){
funcA(7,8);
}
int main(void){
int x = funcA(1,2)+
funcA(3,4);
printf( "x: %i (==10)\n", x );
funcA(5,6);
funcB();
}
Output:
file:main.c line: 22 function: main args: 1,2
file:main.c line: 24 function: main args: 3,4
x: 10 (==10)
file:main.c line: 28 function: main args: 5,6
file:main.c line: 17 function: funcB args: 7,8
Profiling helps.
Since you mentioned oprofile in another comment, I'll say that oprofile supports generating callgraphs on profiled programs.
See http://oprofile.sourceforge.net/doc/opreport.html#opreport-callgraph for more details.
It's worth noting this is definitely not as clear as the callers profile you may get from gprof or another profiler, as the numbers it reports is the number of times oprofile collected a sample in which X is the caller for a given function, not the number of times X called a given function. But this should be sufficient to figure out the top callers of a given function.
A somewhat cumbersome method, but not requiring additional tools:
#define COUNTED_CALL( fn, ...) do{ \
fprintf( call_log_fp, "%s->%s\n", __FUNCTION__, #fn ) ; \
(fn)(__VA_ARGS__) ; \
}while(0) ;
Then all calls written like:
int input_available = COUNTED_CALL( scanf, "%s", &instring ) ;
will be logged to the file associated to call_log_fp (a global FILE* which you must have initialised). The log for the above would look like:
main->scanf
You can then process that log file to extract the data you need. You could even write your own code to do the instrumentation which would make it perhaps less cumbersome.
Might be a bit ambiguous for C++ class member functions though. I am not sure if there is a __CLASS__ macro.
In addition to the aforementioned gprof profiler, you may also try the gcov code-coverage tool. Information on compiling for and using both should be included in the gcc manual.
Once again, stack sampling to the rescue! Just take a bunch of "stackshots", as many as you like. Discard any samples where your function (call it F) is not somewhere on the stack. (If you're discarding most of them, then F is not a performance problem.)
On each remaining sample, locate the call to F, and see what function (call it G) that call is in. If F is recursive (it appears more than once on the sample) only use the topmost call.
Rank your Gs by how many stacks each one appears in.
If you don't want to do this by hand, you could make a simple tool or script. You don't need a zillion samples. 20 or so will give you reasonably good information.
By the way, if what you're really trying to do is find performance problems, you don't actually need to do all that discarding and ranking. In fact - don't discard the exact locations of the call instruction inside each G. Those can actually tell you a good bit more than just the fact that they were somewhere inside G.
P.S. This is all based on the assumption that when you say "calls it the most" you mean "spends the most wall clock time in calling it", not "calls it the greatest number of times". If you are interested in performance, fraction of wall clock time is more useful than invocation count.

How to correctly benchmark a [templated] C++ program

< backgound>
I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.
< /background>
Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less).
Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.
However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.
Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).
The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.
If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.
So how can I do a correct benchmark of a little feature extracted from a library ?
Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?
Thanks for advices !
There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.
I think I will try all of these.
Thanks a lot for all your answers!
If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.
template<typename T> void sink(T const& t) {
volatile T sinkhole = t;
}
No iostream overhead, just a copy that has to remain in the generated code.
Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.
Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.
Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.
Compilers are only allowed to eliminate code-branches that can not happen. As long as it cannot rule out that a branch should be executed, it will not eliminate it. As long as there is some data dependency somewhere, the code will be there and will be run. Compilers are not too smart about estimating which aspects of a program will not be run and don't try to, because that's a NP problem and hardly computable. They have some simple checks such as for if (0), but that's about it.
My humble opinion is that you were possibly hit by some other problem earlier on, such as the way C/C++ evaluates boolean expressions.
But anyways, since this is about a test of speed, you can check that things get called for yourself - run it once without, then another time with a test of return values. Or a static variable being incremented. At the end of the test, print out the number generated. The results will be equal.
To answer your question about in-vitro testing: Yes, do that. If your app is so time-critical, do that. On the other hand, your description hints at a different problem: if your deltas are in a timeframe of 1e-3 seconds, then that sounds like a problem of computational complexity, since the method in question must be called very, very often (for few runs, 1e-3 seconds is neglectible).
The problem domain you are modeling sounds VERY complex and the datasets are probably huge. Such things are always an interesting effort. Make sure that you absolutely have the right data structures and algorithms first, though, and micro-optimize all you want after that. So, I'd say look at the whole context first. ;-)
Out of curiosity, what is the problem you are calculating?
You have a lot of control on the optimizations for your compilation. -O1, -O2, and so on are just aliases for a bunch of switches.
From the man pages
-O2 turns on all optimization flags specified by -O. It also turns
on the following optimization flags: -fthread-jumps -falign-func‐
tions -falign-jumps -falign-loops -falign-labels -fcaller-saves
-fcrossjumping -fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks -fexpensive-optimizations -fgcse
-fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐
order-blocks -freorder-functions -frerun-cse-after-loop
-fsched-interblock -fsched-spec -fschedule-insns -fsched‐
ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre
-ftree-vrp
You can tweak and use this command to help you narrow down which options to investigate.
...
Alternatively you can discover which binary optimizations are
enabled by -O3 by using:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled
Once you find the culpret optimization you shouldn't need the cout's.
If this is possible for you, you might try splitting your code into:
the library you want to test compiled with all optimizations turned on
a test program, dinamically linking the library, with optimizations turned off
Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).
You could create a dummy function in a separate cpp file that does nothing, but takes as argument whatever is the type of your calculation result. Then you can call that function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of invoking a function (which shouldn't skew your results unless you call it a lot!).
#include <iostream>
// Mark coords as extern.
// Compiler is now NOT allowed to optimise away coords
// This it can not remove the loop where you initialise it.
// This is because the code could be used by another compilation unit
extern double coords[500][3];
double coords[500][3];
int main()
{
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
std::cout << "hello world !"<< std::endl;
return 0;
}
edit: the easiest thing you can do is simply use the data in some spurious way after the function has run and outside your benchmarks. Like,
StartBenchmarking(); // ie, read a performance counter
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
StopBenchmarking(); // what comes after this won't go into the timer
// this is just to force the compiler to use coords
double foo;
for (int j = 0 ; j < 500 ; ++j )
{
foo += coords[j][0] + coords[j][1] + coords[j][2];
}
cout << foo;
What sometimes works for me in these cases is to hide the in vitro test inside a function and pass the benchmark data sets through volatile pointers. This tells the compiler that it must not collapse subsequent writes to those pointers (because they might be eg memory-mapped I/O). So,
void test1( volatile double *coords )
{
//perform a simple initialization of all coordinates:
for (int i=0; i<1500; i+=3)
{
coords[i+0] = 3.23;
coords[i+1] = 1.345;
coords[i+2] = 123.998;
}
}
For some reason I haven't figured out yet it doesn't always work in MSVC, but it often does -- look at the assembly output to be sure. Also remember that volatile will foil some compiler optimizations (it forbids the compiler from keeping the pointer's contents in register and forces writes to occur in program order) so this is only trustworthy if you're using it for the final write-out of data.
In general in vitro testing like this is very useful so long as you remember that it is not the whole story. I usually test my new math routines in isolation like this so that I can quickly iterate on just the cache and pipeline characteristics of my algorithm on consistent data.
The difference between test-tube profiling like this and running it in "the real world" means you will get wildly varying input data sets (sometimes best case, sometimes worst case, sometimes pathological), the cache will be in some unknown state on entering the function, and you may have other threads banging on the bus; so you should run some benchmarks on this function in vivo as well when you are finished.
I don't know if GCC has a similar feature, but with VC++ you can use:
#pragma optimize
to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.
Just a small example of an unwanted optimization:
#include <vector>
#include <iostream>
using namespace std;
int main()
{
double coords[500][3];
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
cout << "hello world !"<< endl;
return 0;
}
If you comment the code from "double coords[500][3]" to the end of the for loop it will generate exactly the same assembly code (just tried with g++ 4.3.2). I know this example is far too simple, and I wasn't able to show this behavior with a std::vector of a simple "Coordinates" structure.
However I think this example still shows that some optimizations can introduce errors in the benchmark and I wanted to avoid some surprises of this kind when introducing new code in a library. It's easy to imagine that the new context might prevent some optimizations and lead to a very inefficient library.
The same should also apply with virtual functions (but I don't prove it here). Used in a context where a static link would do the job I'm pretty confident that decent compilers should eliminate the extra indirection call for the virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal.
Then I'll call it hundred of thousand times in a context where the compiler cannot guess what will be the exact type of the pointer and have a 20% increase of running time...
at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;
The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.