Do variables affect performance?

Do variables affect performance? - c++

I am using c++ with QT 5.6. I have simple console application in 2 styles as follows:
//First style
qstring x = “Hi!”;
void func()
{
QTextStream(stdout) << x;
}
int main()
{
while (true)
{
func_one();
}
}
//Second style
void func()
{
QTextStream(stdout) << “Hi!”;
}
int main()
{
while (true)
{
func();
}
}
Which will stress out the cpu more and therefore have lesser performance there might not be a big difference but when we apply this to large scale such as a server where every 2 seconds a connection is made it makes a situation similar to the loop above and with multiple variables (but not the same variable and data) a little less resource usage can cause great performance improvements with lesser resource usage. So is using variables gives any performance improvements but I will be using the variable only once in my function though the function will be called repetitively or will using variables slows the program as it has to repetitively check the ram for where is the value of “x” stored and then retrieve the data?
Edit 1:
I will not be using the variable again in my code and we can consider that there is no compiler optimizations. #DrDonut the answer in the link you gave also doesn't answer is $array === (array) $array faster than is_array($array) i.e is it a micro-optimization and I am also asking is the second style a micro-optimization or does it harm the performance.

Your example is bad because of possible compiler optimizations and because it is not clear will you use this variable in different places or it is just a test code which will be thrown out.
But generally you are optimizing in a wrong way. There is no sense to optimize single variable or single function. You should not guess where your program will spend its time, you should first write your program in the way it works and looks OK.
After the program works, if you find its perfomance is bad you should search for bottlenecks - places where program spends a lot of time. They can be found with the help of profilers or in debugger, not by guessing.
When you found them, you need to optimize these critical places.
Read about premature optimization

Related

Visual studio: how to make C++ consoleApplication use more of CPU power?

so i am running a console project, but when the code is running i see in Task manager that only 5% (2.8 GHz) of Cpu is been used, of course i am not exacly sure how cpu distribute the proccessing power in windows to begain with. but for more of a future reffrence i would like to know if i had a performance demanding code that i need the answer faster how would i do that?
here is the code if you would like to know:
#include "stdafx.h"
#include <iostream>
#include <string.h>
using namespace std;
void swap(char *x, char *y)
{
char temp;
temp = *x;
*x = *y;
*y = temp;
}
void permute(char *a, int l, int r)
{
int i;
if (l == r)
cout << a << endl;
else
{
for (i = l; i <= r; i++)
{
swap((a + l), (a + i));
permute(a, l + 1, r);
swap((a + l), (a + i));
}
}
}
int main()
{
char Short[] = "ABCD";
int n1 = strlen(Short);
char Long[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int n2 = strlen(Long);
while(true)
{
cout << "Would you like to see the permutions of only a) ABCD or b) the whole alphabet?!\n(please enter a or b): ";
char input;
cin >> input;
if (input == 'a')
{
cout << "The permutions of ABCD:\n";
permute(Short, 0, n1 - 1);
cout << "-----------------------------------";
}
else if (input == 'b')
{
cout << "The permutions of Alphabet:\n";
permute(Long, 0, n2 - 1);
cout << "-----------------------------------";
}
else
{
cout << "ERROR! : Enter either a or b.\n";
}
}
}
i found the code in a blog to show the permutions of "ABCD" as part of an assgiment but i also used it for the entire alphabet, and i wanted to know for that use is there a way to make code use more cpu?(it's kinda taking a much longer time than i expected)

Learning to optimize code efficiently is a major challenge for the even experienced coders, and there are volumes of books, articles, and presentations on the topic. As such, a complete treatment is well out of scope for a Stack Overflow question.
That said, here are a few principles:
Focus initially on the algorithm. You can write a messy bubble sort or an efficient one, but in most 'real world' cases quicksort will beat either handily. This is arguably the primary reason the field of computer science exists: the study and selection of algorithms and their theoretical performance.
Related to this, make sure you are comparing your implementation against a 'stock' algorithm when possible. For example, you should see how your implementation performs compared to using the C++11 std::random_shuffle in the <random> header.
Optimize the compiler settings first. Debug builds are never going to be fast, and they aren't supposed to be. Using inline can help, but only happens if the compiler is actually doing inline optimizations. For Visual C++, there are a number of different optimization settings you can try out, but remember that there are tradeoffs so /Ox (maximum optimization) may not always be the right choice, which is why most templates default to /O2 (maximize speed). In some cases, /O1 (minimize space) is actually better.
Always measure performance before and after optimization. Modern out-of-order CPUs are sophisticated systems, and they don't always do what you think they are doing. In many cases, what is a textbook optimization in code actually performs worse than the original code due to various pipelining and microarchitecture effects. The only way to know for sure is to use a good profiler, have solid test cases, and measure the impact of any optimization work. If it's slower on average than before, then revert to the 'unoptimized' version and try something else.
Focus optimization on the hotspots. This is the so-called '80/20' rule. In many applications the vast majority of the code is run rarely, so only a few areas of your application are actually spending enough time running to be worth optimizing.
As a corollary to this rule, having all of your code using extremely inefficient anti-patterns can really hurt the baseline performance of your entire application. For this reason, it's worth knowing how to write good code generally. The point of the 80/20 rule is to spend your limited time optimizing on the areas that will have the most impact rather than what you as the programmer assume matters.
All that said, in your case none of this matters. The vast majority of the CPU time is spent just creating your process and handling the serialized input and output. When dealing with an n of 4 or 26, it doesn't matter how bad or good your algorithm is. In other words, it is highly unlikely permute is your program's 'hotspot' unless you are working with tens of thousands of millions of characters.

NOTE: Yes I am oversimplifying the topic, but I'm concerned that
without this basic understanding, the more advanced topics will
actually lead to some disastrous program designs.
Maybe I'm missing something, but there also seems to be a misunderstanding regarding the link between CPU and efficiency in your mind.
Your program has N instructions, and the CPU will process those N instructions at relatively the same speed (3.56 GHz is about 3.56 billion instructions per second). That's the same (more or less), whether you're getting "5%" or "25%" use of the CPU from a single program. (I'll explain that percentage in a moment.)
The only way to get "faster" in terms of processor usage, as erip said, with parallel computing techniques, which in a nutshell employ multiple CPUs to accomplish the task.
If you think of it like an assembly line, your one worker can only process one widget at a time. If your batch of widgets to him takes up 5% of his time, that means that in order to process ALL of your widgets one-by-one, he uses 5% of his time, and the other 95% is not needed for that batch (and he'll probably use it for some other batches other people assigned him.)
He cannot process more than one widget at a time, so that's as fast as he'll get with your batch. You might be able to make things appear faster by having him alternate between two different types of widgets, instead of finishing all of batch A before starting on batch B, but it will still take the same amount of time in the end to process both batches.
MASSIVE EXCEPTION: If he's spending 100% of his time on someone else's batch of widgets, you're literally going to have to cool your heels. That's not something you can do a thing about.
However, if you add another worker to that assembly line, they can process twice (roughly) the widgets in the same amount of time, because you are processing two widgets at once. When we say you have a "quad core processor", that basically means that you have four workers available (literally 4 CPUs). Each one can only process a single instruction at once, but by assigning more than one to the batch of widgets, you get it done faster.
All of this said, one must keep in mind that those CPUs are doing a lot - they run the entire computer. You want to try and keep those percentages down as much as possible, so your program is fast and responsive on any supported computer. Not all of your users will have 3.46 GHz quad-core machines, after all.

Surely the reason this program is not using all available CPU bandwidth is because it's emitting the permutation results to the screen once for each permutation. This will result in blocking I/O within the implementation of cout.
If you want 100% cpu use you'll want to separate computation from I/O. In this case you'd then need to either:
a) store the results for later output, or
b) communicate results across a thread boundary (which will itself have a an efficiency cost because of the cost of acquiring mutexes and synchronising cache memory), or
c) a combination of the above (batching results and communicating them across the thread boundary)
For a quick check, you could remove comment out all the cout calls and see how much CPU use you get (as mentioned it will be close to 100% divided by the number of CPUs on your computer).

2 while loops vs if else statement in 1 while loop

First a little introduction:
I'm a novice C++ programmer (I'm new to programming) writing a little multiplication tables practising program. The project started as a small program to teach myself the basics of programming and I keep adding new features as I learn more and more about programming. At first it just had basics like ask for input, loops and if-else statements. But now it uses vectors, read and writes to files, creates a directory etc.
You can see the code here: Project on Bitbucket
My program now is going to have 2 modes: practise a single multiplication table that the user can choose himself or practise all multiplication tables mixed. Now both modes work quite different internally. And I developed the mixed mode as a separate program, as would ease the development, I could just focus on writing the code itself instead of also bothering how I will integrate it in the existing code.
Below the code of the currently separate mixed mode program:
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <algorithm>
#include <time.h>
using namespace std;
using std::string;
int newquestion(vector<int> remaining_multiplication_tables, vector<int> multiplication_tables, int table_selecter){
cout << remaining_multiplication_tables[table_selecter] << " * " << multiplication_tables[remaining_multiplication_tables[table_selecter]-1]<< " =" << "\n";
return remaining_multiplication_tables[table_selecter] * multiplication_tables[remaining_multiplication_tables[table_selecter]-1];
}
int main(){
int usersanswer_int;
int cpu_answer;
int table_selecter;
string usersanswer;
vector<int> remaining_multiplication_tables = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
vector<int> multiplication_tables(10, 1);//fill vector with 10 elements that contain the value '1'. This vector will store the "progress" of each multiplication_table.
srand(time(0));
table_selecter = rand() % remaining_multiplication_tables.size();
cpu_answer = newquestion(remaining_multiplication_tables, multiplication_tables, table_selecter);
while(remaining_multiplication_tables.size() != 0){
getline(cin, usersanswer);
stringstream usersanswer_stream(usersanswer);
usersanswer_stream >> usersanswer_int;
if(usersanswer_int == cpu_answer){
cout << "Your answer is correct! :)" << "\n";
if(multiplication_tables[remaining_multiplication_tables[table_selecter]-1] == 10){
remaining_multiplication_tables.erase(remaining_multiplication_tables.begin() + table_selecter);
}
else{
multiplication_tables[remaining_multiplication_tables[table_selecter]-1] +=1;
}
if (remaining_multiplication_tables.size() != 0){
table_selecter = rand() % remaining_multiplication_tables.size();
cpu_answer = newquestion(remaining_multiplication_tables, multiplication_tables, table_selecter);
}
}
else{
cout << "Unfortunately your answer isn't correct! :(" << "\n";
}
}
return 0;
}
As you can see the newquestion function for the mixed mode is quite different. Also the while loop includes other mixed mode specific code.
Now if I want to integrate the mixed multiplication tables mode into the existing main program I have 2 choices:
-I can clutter up the while loop with if-else statements to check each time the loop runs whether mode == 10 (single multiplication table mode) or mode == 100 (mixed multiplication tables mode). And also place a if-else statement in the newquestion() function to check if mode == 10 or mode == 100
-I can let the program check on startup whether the user chose single multiplication table or mixed multiplication tables mode and create 2 while loops and 2 newquestion() functions. That would look like this:
int newquestion_mixed(){
//newquestion function for mixed mode
}
int newquestion_single(){
//newquestion function for single mode
}
//initialization
if mode == 10
//create necessary variables for single mode
while(){
//single mode loop
}
else{
//create necessary variables for mixed mode
while(){
//mixed mode loop
}
}
Now why would I bother creating 2 separate loops and functions? Well isn't it inefficient if the program checks each time the loop runs (each time the user is asked a new question, for example: '5 * 3 =') which mode the user chose? I'm worried about the performance with this option. Now I hear you think: but why would you bother about performance for such a simple, little non-performance critical application with the extremely powerful processors today and the huge amounts of RAM? Well, as I said earlier this program is mainly about teaching myself a good coding style and learning how to program etc. So I want to teach myself the good habits from the beginning.
The 2 while loops and functions option is much more efficient will use less CPU, but more space and includes duplicating code. I don't know if this is a good style either.
So basically I'm asking the experts what's the best style/way to handle this kind of things. Also if you spot something bad in my code/bad style please tell me, I'm very open to feedback because I'm still a novice. ;)

First, a fundamental rule of programming is that of "don't prematurely optimize the code" - that is, don't fiddle around with little details, before you have the code working correctly, and write code that expresses what you want done as clearly as possible. This is good coding style. To obsess over the details of "which is faster" (in a loop that spends most of it's time waiting for the user to input some number) is not good coding style.
Once it's working correcetly, analyse (using for example a profiler tool) where the code is spending it's time (assuming performance is a major factor in the first place). Once you have located the major "hotspot", then try to make that better in some way - how you go about that depends very much on what that particular hot-spot code does.
As to which performs best will highly depend on details the code and the compiler (and which compiler optimizations are chosen). It is quite likely that having an if inside a while-loop will run slower, but modern compilers are quite clever, and I have certainly seen cases where the compiler hoists such a choice out of the loop, in cases where the conditions don't change. Having two while-loops is harder for the compiler to "make better", because it most likely won't see that you are doing the same thing in both loops [because the compiler works from the bottom of the parse-tree up, and it will optimize the inside of the while-loop first, then go out to the if-else side, and at that point it's "lost track" of what's going on inside each loop].
Which is clearer, to have one while loop with an if inside, or an if with two while-loops, that's another good question.
Of course, the object oriented solution is to have two classes - one for mixed, another for single - and just run one loop, that calls the relevant (virtual) member function of the object created based on an if-else statement before the loop.

Modern CPU branch predictors are so good that if during the loop the condition never changes, it will probably be as fast as having two while loops in each branch.

Is there a faster alternative to if-else in this case?

while(some_condition){
if(FIRST)
{
do_this;
}
else
{
do_that;
}
}
In my program the possibility of if(FIRST) succeeding is about 1 in 10000. Can there be any alternative in C/C++ such that we can avoid checking the condition on every iteration inside the while loop with the hope of seeing a better performance in this case.
Ok! Let me put in some more detail.
i am writing a code for a signal acquisiton and tracking scheme where the state of my system will remain in TRACKING mode more often that ACQUISITION mode.
while(signal_present)
{
if(ACQUISITION_SUCCEEDED)
{
do_tracking(); // this functions can change the state from TRACKING to ACQUISITION
}
else
{
do_acquisition(); // this function can change the state from ACQUISITION to TRACKING
}
}
So what happens here is that the system usually remains in tracking mode but it can enter acquisition mode when tracking fails but is not a common occurrence.( Assume the incoming data to be infinite in number. )

The performance cost of a single branch is not going to be a big deal. The only thing you really can do is put the most likely code first, save on some instruction cache. Maybe. This is really deep into micro-optimization.

There is no particularly good reason to try to optimize this. Almost all modern architectures incorporate branch predictors. These speculate that a branch (an if or else) will be taken essentially the way it has been in the past. In your case, the speculation will always succeed, eliminating all overhead. There are non-portable ways to hint that a condition is taken one way or another, but any branch predictor will work just as well.
One thing you might want to do to improve instruction-cache locality is to move do_that out of the while loop (unless it is a function call).

The GCC has a __builtin_expect “function” that you can use to indicate to the compiler which branch will likely be taken. You could use it like this:
if(__builtin_expect(FIRST, 1)) …
Is this useful? I have no idea. I have never used it, never seen it used (except allegedly in the Linux kernel). The GCC documentation actually discourages its usage in favour of using profiling information to achieve a more reliable metric.

On recent x86 processor systems, final execution speed will barely rely on source code implementation.
You can have a look at this page http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/ to see amount the optimization that occurs inside the processor.

If this test is really consuming significant time compared to the implementation of do_aquisition, then you might get a boost by having a function table:
typedef void (*trackfunc)(void);
trackfunc tracking_action[] = {do_acquisition, do_tracking};
while (signal_present)
{
tracking_action[ACQUISITION_STATE]();
}
The effects of these kinds of manual optimizations are very dependent on the platform, the compiler, and the optimization settings.
You will most likely get a much greater performance gain by spending your time measuring and tuning the do_aquisition and do_tracking algorithms.

If you don't know when "FIRST" will be true, then no.
The issue is whether FIRST is time consuming or not; maybe you could evaluate FIRST before the loop (or part of it) and just test the boolean.

I'd change moonshadow's code a little bit to
while( some_condition )
{
do_that;
if( FIRST )
{
do_this; // overwrite what you did earlier.
}
}

Based on your new information, I'd say something like the following:
while(some_condition)
{
while(ACQUISITION_SUCCEEDED)
{
do_tracking();
}
if (some_condition)
while(!ACQUISITION_SUCCEEDED)
{
do_acquisition();
}
}
The point is that the ACQUISITION_SUCCEEDED state must include the some_condition information to a certain extent (i.e. it will break out of the inner loops if some_condition is false - hence there is a chance to break out of the outer loop)

This is a classic in optimization. You should avoid putting conditionals within loops if you can. This code:
while(...)
{
if( a )
{
foo();
}
else
{
bar();
}
}
is often better to rewrite as:
if( a )
{
while(...)
{
foo();
}
}
else
{
while(...)
{
bar();
}
}
It's not always possible though, and you should always when you try to optimize something measure the performance before and after.

There is not much more useful optimizing you can do with your example.
The call / branch to the do_this and do_that may negate any savings you earned by optimizing an if-then-else statement.
One of the rules of performance optimizing is to reduce branches. Most processors prefer to execute sequential code. They can take a chunk of sequential code and haul it into their caches. Branching interrupts this pleasantry and may cause a complete reload of the instruction cache (which loses valuable execution time).
Before you micro-optimize at this level, review your design to see if you can:
Eliminate unnecessary branching.
Split up code so it fits into the
cache.
Organize the data to reduce fetches
from memory or hard drive.
I'm sure that the above steps will gain you more performance than optimizing your posted loop.

Is Loop Hoisting still a valid manual optimization for C code?

Using the latest gcc compiler, do I still have to think about these types of manual loop optimizations, or will the compiler take care of them for me well enough?

If your profiler tells you there is a problem with a loop, and only then, a thing to watch out for is a memory reference in the loop which you know is invariant across the loop but the compiler does not. Here's a contrived example, bubbling an element out to the end of an array:
for ( ; i < a->length - 1; i++)
swap_elements(a, i, i+1);
You may know that the call to swap_elements does not change the value of a->length, but if the definition of swap_elements is in another source file, it is quite likely that the compiler does not. Hence it can be worthwhile hoisting the computation of a->length out of the loop:
int n = a->length;
for ( ; i < n - 1; i++)
swap_elements(a, i, i+1);
On performance-critical inner loops, my students get measurable speedups with transformations like this one.
Note that there's no need to hoist the computation of n-1; any optimizing compiler is perfectly capable of discovering loop-invariant computations among local variables. It's memory references and function calls that may be more difficult. And the code with n-1 is more manifestly correct.
As others have noted, you have no business doing any of this until you've profiled and have discovered that the loop is a performance bottleneck that actually matters.

Write the code, profile it, and only think about optimising it when you have found something that is not fast enough, and you can't think of an alternative algorithm that will reduce/avoid the bottleneck in the first place.
With modern compilers, this advice is even more important - if you write simple clean code, the compiler's optimiser can often do a better job of optimising the code than it can if you try to give it snazzy "pre-optimised" code.

Check the generated assembly and see for yourself. See if the computation for the loop-invariant code is being done inside the loop or outside the loop in the assembly code that your compiler generates. If it's failing to do the loop hoisting, do the hoisting yourself.
But as others have said, you should always profile first to find your bottlenecks. Once you've determined that this is in fact a bottleneck, only then should you check to see if the compiler's performing loop hoisting (aka loop-invariant code motion) in the hot spots. If it's not, help it out.

Compilers generally do an excellent job with this type of optimization, but they do miss some cases. Generally, my advice is: write your code to be as readable as possible (which may mean that you hoist loop invariants -- I prefer to read code written that way), and if the compiler misses optimizations, file bugs to help fix the compiler. Only put the optimization into your source if you have a hard performance requirement that can't wait on a compiler fix, or the compiler writers tell you that they're not going to be able to address the issue.

Where they are likely to be important to performance, you still have to think about them.
Loop hoisting is most beneficial when the value being hoisted takes a lot of work to calculate. If it takes a lot of work to calculate, it's probably a call out of line. If it's a call out of line, the latest version of gcc is much less likely than you are to figure out that it will return the same value every time.
Sometimes people tell you to profile first. They don't really mean it, they just think that if you're smart enough to figure out when it's worth worrying about performance, then you're smart enough to ignore their rule of thumb. Obviously, the following code might as well be "prematurely optimized", whether you have profiled or not:
#include <iostream>
bool isPrime(int p) {
for (int i = 2; i*i <= p; ++i) {
if ((p % i) == 0) return false;
}
return true;
}
int countPrimesLessThan(int max) {
int count = 0;
for (int i = 2; i < max; ++i) {
if (isPrime(i)) ++count;
}
return count;
}
int main() {
for (int i = 0; i < 10; ++i) {
std::cout << "The number of primes less than 1 million is: ";
std::cout << countPrimesLessThan(1000*1000);
std::cout << std::endl;
}
}
It takes a "special" approach to software development not to manually hoist that call to countPrimesLessThan out of the loop, whether you've profiled or not.

Early optimizations are bad only if other aspects - like readability, clarity of intent, or structure - are negatively affected.
If you have to declare it anyway, loop hoisting can even improve clarity, and it explicitely documents your assumption "this value doesn't change".
As a rule of thumb I wouldn't hoist the count/end iterator for a std::vector, because it's a common scenario easily optimized. I wouldn't hoist anything that I can trust my optimizer to hoist, and I wouldn't hoist anything known to be not critical - e.g. when running through a list of dozen windows to respond to a button click. Even if it takes 50ms, it will still appear "instanteneous" to the user. (But even that is a dangerous assumption: if a new feature requires looping 20 times over this same code, it suddenly is slow). You should still hoist operations such as opening a file handle to append, etc.
In many cases - very well in loop hoisting - it helps a lot to consider relative cost: what is the cost of the hoisted calculation compared to the cost of running through the body?
As for optimizations in general, there are quite some cases where the profiler doesn't help. Code may have very different behavior depending on the call path. Library writers often don't know their call path otr frequency. Isolating a piece of code to make things comparable can already alter the behavior significantly. The profiler may tell you "Loop X is slow", but it won't tell you "Loop X is slow because call Y is thrashing the cache for everyone else". A profiler couldn't tell you "this code is fast because of your snarky CPU, but it will be slow on Steve's computer".

A good rule of thumb is usually that the compiler performs the optimizations it is able to.
Does the optimization require any knowledge about your code that isn't immediately obvious to the compiler? Then it is hard for the compiler to apply the optimization automatically, and you may want to do it yourself
In most cases, lop hoisting is a fully automatic process requiring no high-level knowledge of the code -- just a lot of lifetime and dependency analysis, which is what the compiler excels at in the first place.
It is possible to write code where the compiler is unable to determine whether something can be hoisted out safely though -- and in those cases, you may want to do it yourself, as it is a very efficient optimization.
As an example, take the snippet posted by Steve Jessop:
for (int i = 0; i < 10; ++i) {
std::cout << "The number of primes less than 1 billion is: ";
std::cout << countPrimesLessThan(1000*1000*1000);
std::cout << std::endl;
}
Is it safe to hoist out the call to countPrimesLessThan? That depends on how and where the function is defined. What if it has side effects? It may make an important difference whether it is called once or ten times, as well as when it is called. If we don't know how the function is defined, we can't move it outside the loop. And the same is true if the compiler is to perform the optimization.
Is the function definition visible to the compiler? And is the function short enough that we can trust the compiler to inline it, or at least analyze the function for side effects? If so, then yes, it will hoist it outside the loop.
If the definition is not visible, or if the function is very big and complicated, then the compiler will probably assume that the function call can not be moved safely, and then it won't automatically hoist it out.

Remember 80-20 Rule.(80% of execution time is spent on 20% critical code in the program)
There is no meaning in optimizing the code which have no significant effect on program's overall efficiency.
One should not bother about such kind of local optimization in the code.So the best approach is to profile the code to figure out the critical parts in the program which consumes heavy CPU cycles and try to optimize it.This kind of optimization will really makes some sense and will result in improved program efficiency.

How to correctly benchmark a [templated] C++ program

< backgound>
I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.
< /background>
Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less).
Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.
However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.
Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).
The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.
If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.
So how can I do a correct benchmark of a little feature extracted from a library ?
Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?
Thanks for advices !
There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.
I think I will try all of these.
Thanks a lot for all your answers!

If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.
template<typename T> void sink(T const& t) {
volatile T sinkhole = t;
}
No iostream overhead, just a copy that has to remain in the generated code.
Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.

Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.
Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.

Compilers are only allowed to eliminate code-branches that can not happen. As long as it cannot rule out that a branch should be executed, it will not eliminate it. As long as there is some data dependency somewhere, the code will be there and will be run. Compilers are not too smart about estimating which aspects of a program will not be run and don't try to, because that's a NP problem and hardly computable. They have some simple checks such as for if (0), but that's about it.
My humble opinion is that you were possibly hit by some other problem earlier on, such as the way C/C++ evaluates boolean expressions.
But anyways, since this is about a test of speed, you can check that things get called for yourself - run it once without, then another time with a test of return values. Or a static variable being incremented. At the end of the test, print out the number generated. The results will be equal.
To answer your question about in-vitro testing: Yes, do that. If your app is so time-critical, do that. On the other hand, your description hints at a different problem: if your deltas are in a timeframe of 1e-3 seconds, then that sounds like a problem of computational complexity, since the method in question must be called very, very often (for few runs, 1e-3 seconds is neglectible).
The problem domain you are modeling sounds VERY complex and the datasets are probably huge. Such things are always an interesting effort. Make sure that you absolutely have the right data structures and algorithms first, though, and micro-optimize all you want after that. So, I'd say look at the whole context first. ;-)
Out of curiosity, what is the problem you are calculating?

You have a lot of control on the optimizations for your compilation. -O1, -O2, and so on are just aliases for a bunch of switches.
From the man pages
-O2 turns on all optimization flags specified by -O. It also turns
on the following optimization flags: -fthread-jumps -falign-func‐
tions -falign-jumps -falign-loops -falign-labels -fcaller-saves
-fcrossjumping -fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks -fexpensive-optimizations -fgcse
-fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐
order-blocks -freorder-functions -frerun-cse-after-loop
-fsched-interblock -fsched-spec -fschedule-insns -fsched‐
ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre
-ftree-vrp
You can tweak and use this command to help you narrow down which options to investigate.
...
Alternatively you can discover which binary optimizations are
enabled by -O3 by using:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled
Once you find the culpret optimization you shouldn't need the cout's.

If this is possible for you, you might try splitting your code into:
the library you want to test compiled with all optimizations turned on
a test program, dinamically linking the library, with optimizations turned off
Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).

You could create a dummy function in a separate cpp file that does nothing, but takes as argument whatever is the type of your calculation result. Then you can call that function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of invoking a function (which shouldn't skew your results unless you call it a lot!).

#include <iostream>
// Mark coords as extern.
// Compiler is now NOT allowed to optimise away coords
// This it can not remove the loop where you initialise it.
// This is because the code could be used by another compilation unit
extern double coords[500][3];
double coords[500][3];
int main()
{
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
std::cout << "hello world !"<< std::endl;
return 0;
}

edit: the easiest thing you can do is simply use the data in some spurious way after the function has run and outside your benchmarks. Like,
StartBenchmarking(); // ie, read a performance counter
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
StopBenchmarking(); // what comes after this won't go into the timer
// this is just to force the compiler to use coords
double foo;
for (int j = 0 ; j < 500 ; ++j )
{
foo += coords[j][0] + coords[j][1] + coords[j][2];
}
cout << foo;
What sometimes works for me in these cases is to hide the in vitro test inside a function and pass the benchmark data sets through volatile pointers. This tells the compiler that it must not collapse subsequent writes to those pointers (because they might be eg memory-mapped I/O). So,
void test1( volatile double *coords )
{
//perform a simple initialization of all coordinates:
for (int i=0; i<1500; i+=3)
{
coords[i+0] = 3.23;
coords[i+1] = 1.345;
coords[i+2] = 123.998;
}
}
For some reason I haven't figured out yet it doesn't always work in MSVC, but it often does -- look at the assembly output to be sure. Also remember that volatile will foil some compiler optimizations (it forbids the compiler from keeping the pointer's contents in register and forces writes to occur in program order) so this is only trustworthy if you're using it for the final write-out of data.
In general in vitro testing like this is very useful so long as you remember that it is not the whole story. I usually test my new math routines in isolation like this so that I can quickly iterate on just the cache and pipeline characteristics of my algorithm on consistent data.
The difference between test-tube profiling like this and running it in "the real world" means you will get wildly varying input data sets (sometimes best case, sometimes worst case, sometimes pathological), the cache will be in some unknown state on entering the function, and you may have other threads banging on the bus; so you should run some benchmarks on this function in vivo as well when you are finished.

I don't know if GCC has a similar feature, but with VC++ you can use:
#pragma optimize
to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.

Just a small example of an unwanted optimization:
#include <vector>
#include <iostream>
using namespace std;
int main()
{
double coords[500][3];
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
cout << "hello world !"<< endl;
return 0;
}
If you comment the code from "double coords[500][3]" to the end of the for loop it will generate exactly the same assembly code (just tried with g++ 4.3.2). I know this example is far too simple, and I wasn't able to show this behavior with a std::vector of a simple "Coordinates" structure.
However I think this example still shows that some optimizations can introduce errors in the benchmark and I wanted to avoid some surprises of this kind when introducing new code in a library. It's easy to imagine that the new context might prevent some optimizations and lead to a very inefficient library.
The same should also apply with virtual functions (but I don't prove it here). Used in a context where a static link would do the job I'm pretty confident that decent compilers should eliminate the extra indirection call for the virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal.
Then I'll call it hundred of thousand times in a context where the compiler cannot guess what will be the exact type of the pointer and have a 20% increase of running time...

at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;
The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js