Bizarrely slow method - c++

I have a method in my project which when placed into its own program takes mere seconds to run, when run inside the project where it belongs it takes 5 minutes. I have NO idea why. I have tried profiling, taking bits out, changing this and that. I'm stumped.
It populates a vector of integers to be used by another class, but this class is not currently being instantiated. I have checked as much as I can and it does seem as if there really is nothing else happening but the method magically taking longer in one project than it does in another.
The method is run at startup and takes about 5 minutes, or about 3 seconds if run on its own. What could be causing this? Strange project settings? Multithreading stuff that I'm unaware of? (as far as I know there is none in my project anyway unless it's done automatically).
There is a link to the project here. If anyone could solve this for me I would be so grateful, and as soon as I can will start a bounty for this.
The method is called PopulatePathVectors and runs in Level.cpp. Commenting out the call to the method (in the level constructor) means the program starts up in seconds. The only other class that uses the lists it generates is Agent, but currently none are being instantiated.
EDIT - As requested, here is the code. Although keep in mind that my question is not 'why is the code slow?' but 'why is it fast in one place and slow in my project?'
//parses the text path vector into the engine
void Level::PopulatePathVectors(string pathTable)
{
// Read the file line by line.
ifstream myFile(pathTable);
for (unsigned int i = 0; i < nodes.size(); i++)
{
pathLookupVectors.push_back(vector<vector<int>>());
for (unsigned int j = 0; j < nodes.size(); j++)
{
string line;
if (getline(myFile, line)) //enter if a line is read successfully
{
stringstream ss(line);
istream_iterator<int> begin(ss), end;
pathLookupVectors[i].push_back(vector<int>(begin, end));
}
}
}
}
EDIT - I understand that the code is not the best it could be, but that isn't the point here. It runs quickly on it's own - about 3 seconds and that's fine for me. The problem I'm tying to solve is why it takes so much longer inside the project.
EDIT - I commented out all of the game code apart from the main game loop. I placed the method into the initialize section of the code which is run once on start up. Apart from a few methods setting up a window it's now pretty much the same as the program with ONLY the method in, only it STILL takes about 5 minutes to run. Now I know it has nothing to do with dependencies on the pathLookupVectors. Also, I know it's not a memory thing where the computer starts writing to the hard drive because while the slow program is chugging away running the method, I can open another instance of VS and run the single method program at the same time which completes in seconds. I realise that the problem might be some basic settings, but I'm not experienced so apologies if this does disappointingly end up being the reason why. I still don't have a clue why it's taking so much longer.
It would be great if this didn't take so long in debug mode as it means waiting 5 minutes every time I make a change. There MUST be a reason why this is being so slow here. These are the other included headers in the cut down project:
#include <d3d10.h>
#include <d3dx10.h>
#include "direct3D.h"
#include "game.h"
#include "Mesh.h"
#include "Camera.h"
#include "Level.h"
#include <vector>
using namespace std;
EDIT - this is a much smaller self contained project with only a tiny bit of code where the problem still happens.
this is also a very small project with the same code where it runs very fast.

I ran this code in MSVC10 (same compiler you are using) and duplicated your results with the projects you provided. However I was unable to profile with this compiler due to using the express version.
I ran this code in the MSVC9 compiler, and it ran 5 times faster! I also profiled it, and got these results:
Initialize (97.43%)
std::vector::_Assign (29.82%)
std::vector::erase (12.86%)
std::vector::_Make_iter (3.71%)
std::_Vector_const_iterator (3.14%)
std::_Iterator_base (3.71%)
std::~_Ranit (3.64%)
std::getline (27.74%)
std::basic_string::append (12.16%)
std::basic_string::_Grow (3.66%)
std::basic_string::_Eos (3.43%)
std::basic_streambuf::snextc (5.61%)
std::operator<<(std::string) (13.04%)
std::basic_streambuf::sputc(5.84%)
std::vector::push_back (11.84%)
std::_Uninit_move::?? (3.32%)
std::basic_istream::operator>>(int) (7.77%)
std::num_get::get (4.6%)
std::num_get::do_get (4.55%)
The "fast" version got these results: (scaled to match other times):
Initialize (97.42%)
std::_Uninit_copy (31.72%)
std::_Construct (18.58%)
std::_Vector_const_iterator::operator++ (6.34%)
std::_Vector_const_iterator::operator!= (3.62%)
std::getline (25.37%)
std::getline (13.14%)
std::basic_ios::widen (12.23%)
std::_Construct (18.58%)
std::vector::vector (14.05%)
std::_Destroy (14.95%)
std::vector::~vector (11.33%)
std::vector::_Tidy (23.46%)
std::_Destroy (19.89%)
std::vector::~vector (12.23%)
[ntdll.dll] (3.62%)
After studying these results and considering Michael Price's comments many times, it dawned on me to make sure the input files were the same size. When this dawned on me, I realized the profile for the "fast" version, does not show std::operator<<(std::string) or std::vector::push_back at all, which seems suspicious. I checked the MethodTest project, and found that it did not have a WepTestLevelPathTable.txt, causing the getline to fail, and the entire function to do almost nothing at all, except allocate a bunch of empty vectors. When I copied the WepTestLevelPathTable.txt to the MethodTest project, it is exactly the same speed as the "slow" verison. Case solved. Use a smaller file for debug builds.

Here're a few methods I believe could slow down the start-up process:
Level::GenerateGridNodes():
void Level::GenerateGridNodes()
{
int gridHeight = gridSize;
int gridWidth = gridSize;
// ADD THIS STATEMENT:
nodes.reserve(gridHeight*gridWidth);
for (int h = 2; h <= gridHeight; h++)
{
for (int w = 2; w <= gridWidth; w++)
{
nodes.push_back(Node(D3DXVECTOR3((float)w, (float)h, 0)));
}
}
}//end GenerateGridNodes()
Level::CullInvalidNodes(): For std::vectors, use remove-erase idiom to make erasing elements faster. You also need to re-think of how this function should work because it seems to have lots of redudant erasing and adding of nodes. Would it make sense in the code that instead of erasing you could simply assign the value you push_back() right after deletion to the value you're erasing? So instead of v.erase(itr) followed by v.push_back(new_element), you could simply do *itr = new_element;? DISCLAIMER: I haven't looked at actually what the functions does. Honestly, I don't have the time for that. I'm just pointing you to a possiblity.
In Level::LinkNodes():
void Level::LinkNodes()
{
//generates a vector for every node
// ADD THIS BEFORE THE FOR LOOP
nodeAdjacencyVectors.reserve(nodes.size());
for (unsigned int i = 0; ....)
//... Rest of the code
}//end LinkNodes()
In short, you still have got room for a lot of improvement. I believe the main hog is the Level class functions. You should have another look through it and probably rethink how each function should be implemented. Especially those invoked within the constructor of Level class.

It contained a loop withing a loop, each inner loop reads each line of an over 30MB data file. Of course it's going to be slow. I would say it's slow by design.

Related

Checking parallel_for loop syntax

I've been asked by a friend to help them improve the performance of their code, I'm not very well versed in C++, with most of my training in Java. The program is designed to find optimal values for a set of data, and uses a fairly blunt "try everything" method, which obviously is fairly slow to run. My solution was to replace the outer most loop of the main process.
However: I am very unfamiliar with this tool, as I've never used it before, and can't tell if I've successfully replicated the original loop's behavior with the new one I've written. Or even if my parameters are correctly formatted. I've searched a good few times but can't seem to find a concise explanation for how to format these as opposed to "normal" for loops.
Additional inclusions (obviously):
include "tbb/parallel_for.h"
The original loop:
for(alpha=30; alpha < 101; alpha= alpha+4) {
The new one:
parallel_for (alpha(30); 101; & {
The example I'm working from:
example:
for (size_t i = 0; i < size; i++)
Example:
parallel_for (size_t(0), size, [&](size_t i)
Sadly; I can't simply test the code, since I haven't the entire program, merely the .cpp file.

Removing code portions does not match the profiler's data

I am doing a little proof of concept profile and optimize type of example. However, I ran into something that I can't quite explain, and I'm hoping that someone here can clear this up.
I wrote a very short snippet of code:
int main (void)
{
for (int j = 0; j < 1000; j++)
{
a = 1;
b = 2;
c = 3;
for (int i = 0; i < 100000; i++)
{
callbackWasterOne();
callbackWasterTwo();
}
printf("Two a: %d, b: %d, c: %d.", a, b, c);
}
return 0;
}
void callbackWasterOne(void)
{
a = b * c;
}
void callbackWasterTwo(void)
{
b = a * c;
}
All it does is call two very basic functions that just multiply numbers together. Since the code is identical, I expect the profiler (oprofile) to return roughly the same number.
I run this code 10 times per profile, and I got the following values for how long time is spent on each function:
main: average = 5.60%, stdev = 0.10%
callbackWasterOne = 43.78%, stdev = 1.04%
callbackWasterTwo = 50.24%, stdev = 0.98%
rest is in miscellaneous things like printf and no-vmlinux
The difference between the time for callbackWasterOne and callbackWasterTwo is significant enough (to me at least) given that they have the same code, that I switched their order in my code and reran the profiler with the following results now:
main: average = 5.45%, stdev = 0.40%
callbackWasterOne = 50.69%, stdev = 0.49%
callbackWasterTwo = 43.54%, stdev = 0.18%
rest is in miscellaneous things like printf and no-vmlinux
So evidently the profiler samples one more than the other based on the execution order. Not good. Disregarding this, I decided to see the effects of removing some code and I got this for execution times (averages):
Nothing removed: 0.5295s
call to callbackWasterOne() removed from for loop: 0.2075s
call to callbackWasterTwo() removed from for loop: 0.2042s
remove both calls from for loop: 0.1903s
remove both calls and the for loop: 0.0025s
remove contents of callbackWasterOne: 0.379s
remove contents of callbackWasterTwo: 0.378s
remove contents of both: 0.382s
So here is what I'm having trouble understanding:
When I remove just one of the calls from the for loop, the execution time drops by ~60%, which is greater than the time spent by that one function + the main in the first place! How is this possible?
why is the effect of removing both calls from the loop so little compared to removing just one? I can't figure out this non-linearity. I understand that the for loop is expensive, but in that case (if most of the remaining time can be attributed to the for loop that performs the function calls), why would removing one of the calls cause such a large improvement in the first place?
I looked at the disassembly and the two functions are the same in code. The calls to them are the same, and removing the call simply deletes the one call line.
Other info that might be relevant
I'm using Ubuntu 14.04LTS
The code is complied by Eclipse with no optimization (O0)
I time the code by running it in terminal using "time"
I use OProfile with count = 10000 and 10 repetitions.
Here are the results from when I do this with -O1 optimization:
main: avg = 5.89%, stdev = 0.14%
callbackWasterOne: avg = 44.28%, stdev = 2.64%
callbackWasterTwo: avg = 49.66%, stdev = 2.54% (greater than before)
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.522s
Remove callbackWasterOne call: 0.149s (71.47% decrease)
Remove callbackWasterTwo call: 0.123% (76.45% decrease)
Remove both calls: 0.0365s (93.01% decrease) (what I would expect given the profile data just above)
So removing one call now is much better than before, and removing both still carries a benefit (probably because the optimizer understands that nothing happens in the loop). Still, removing one is much more beneficial than I would have anticipated.
Results of the two functions using different variables:
I defined 3 more variables for callbackWasterTwo() to use instead of reusing same ones. Now the results are what I would have expected.
main: avg = 10.87%, stdev = 0.19% (average is greater, but maybe due to those new variables)
callbackWasterOne: avg = 46.08%, stdev = 0.53%
callbackWasterTwo: avg = 42.82%, stdev = 0.62%
Rest is miscellaneous
Results of removing various bits (execution time averages):
Nothing removed: 0.520s
Remove callbackWasterOne call: 0.292s (43.83% decrease)
Remove callbackWasterTwo call: 0.291% (44.07% decrease)
Remove both calls: 0.065s (87.55% decrease)
So now removing both calls is pretty much equivalent (within stdev) to removing one call + the other.
Since the result of removing either function is pretty much the same (43.83% vs 44.07%), I am going to go out on a limb and say that perhaps the profiler data (46% vs 42%) is still skewed. Perhaps it is the way it samples (going to vary the counter value next and see what happens).
It appears that the success of optimization relates pretty strongly to the code reuse fraction. The only way to achieve "exactly" (you know what I mean) the speedup noted by the profiler is to optimize on completely independent code. Anyway this is all interesting.
I am still looking for some explanation ideas for the 70% decrease in the -O1 case though...
I did this with 10 functions (different formulas in each, but using some combination of 6 different variables, 3 at a time, all multiplication):
These results are disappointing to say the least. I know the functions are identical, and yet, the profiler indicates that some take significantly longer. No matter which one I remove ("fast" or "slow" one), the results are the same ;) So this leaves me to wonder, how many people are incorrectly relying on the profiler to indicate the incorrect areas of code to fix? If I unknowingly saw these results, what could possible tell me to go fix the 5% function rather than the 20% (even though they are exactly the same)? What if the 5% one was much easier to fix, with a large potential benefit? And of course, this profiler might just not be very good, but it is popular! People use it!
Here is a screenshot. I don't feel like typing it in again:
My conclusion: I am overall rather disappointed with Oprofile. I decided to try out callgrind (valgrind) through command line on the same function and it gave me far more reasonable results. In fact, the results were very reasonable (all functions spent ~ the same amount of time executing). I think Callgrind samples far more than Oprofile ever did.
Callgrind will still not explain the difference in improvement when a function is removed, but at least it gives the correct baseline information...
Ah, I see you did look at the assembly. This question is indeed interesting in its own right, but in general there's no point profiling unoptimized code, since there's so much boilerplate that could easily be reduced even in -O1.
If it's really only the call that's missing, then that could explain the timing differences -- there's lots of boilerplate from the -O0 stack manipulation code (any caller-saved registers have to be pushed onto the stack, and any arguments too, then afterwards the any return value has to be treated and the opposite stack manipulation has to be done) which contributes to the time it takes to call the functions, but is not necessarily completely attributed to the functions themselves by oprofile since that code is executed before/after the function is actually called.
I suspect the reason the second function seems to always take less time is that there's less (or no) stack juggling that needs to be done -- the parameter values are already on the stack thanks to the previous function call, and so, as you've seen, only the call to the function has to be executed, without any other extra work.

Orienteering map game using C++

I solved a programming puzzle using a brute force method and without dynamic programming, and it worked fine. Here is the puzzle:
An orienteering map is to be given in the following format.
" ##### "
" #...# "
" #S#G# "
" ##### "
Calculate the minimum distance from the start to the goal with passing all the checkpoints.
A map consists of 5 characters as following. You can assume that the map does not contain any invalid characters and the map has exactly one start symbol 'S' and exactly one goal symbol 'G'.
'S' means the orienteering start.
'G' means the orienteering goal.
'#' means an orienteering checkpoint.
'.' means an opened-block that players can pass.
'#' means a closed-block that players cannot pass.
It is allowed to move only by one step vertically or horizontally (up, down, left, or right) to the next block. Other types of movements, such as moving diagonally (left up, right up, left down and right down) and skipping one or more blocks, are NOT permitted.
You MUST NOT get out of the map.
Distance is to be defined as the number of movements to the different blocks.
You CAN pass opened-blocks, checkpoints, the start, and the goal more than once if necessary.
You can assume that parameters satisfy following conditions.
1 <= width <= 100
1 <= height <= 100
The maximum number of checkpoints is 18.
Then I found a much faster solution, which I don't understand some things about:
#include<iostream>
#include<algorithm>
#include<cstdio>
#include<vector>
#include<cstring>
#include<map>
#include<queue>
#include<stack>
#include<string>
#include<cstdlib>
#include<ctime>
#include<set>
#include<math.h>
using namespace std;
typedef long long LL;
const int maxn = 1e2+ 10;
#define rep(i,a,b) for(int i=(a);i<=(b);i++)
#define pb push_back
std::vector<int>path;
const int INF=1<<20;
struct Point
{
int x,y;
bool operator < (const Point &a)const
{
return x<a.x||(x==a.x)&&y<a.y;
}
};
std::vector<Point>P;
char mat[maxn][maxn];
int vis[maxn][maxn];
int w,h,s,e;
int d[1<<20][20];
int dx[]={-1,0,0,1};
int dy[]={0,-1,1,0};
int dist[25][25];
int main(){
ios_base::sync_with_stdio(false);
cin.tie(0);
while(cin>>w>>h){
map<Point,int>id;
P.clear();
path.clear();
memset(d,100,sizeof d);
memset(dist,100,sizeof dist);
for(int i=0;i<h;i++){
scanf("%s",mat[i]);
for(int j=0;mat[i][j];++j){
char &c=mat[i][j];
if(c=='S'||c=='G'||c=='#'){
P.pb((Point){i,j});
int sz=P.size();
id[P[sz-1]]=sz;
if(c=='S')s=sz-1;
else if(c=='G')e=sz-1;
path.pb(sz-1);
}
}
}
for(int i=0;i<path.size();i++){
Point now=P[path[i]];
int x=path[i];
//out<<"x "<<x<<endl;
dist[x][x]=0;
memset(vis,0,sizeof vis);
vis[now.x][now.y]=1;
queue<Point>q;
q.push(now);
//cout<<"Bfs"<<endl;
while(!q.empty()){
now=q.front();q.pop();
for(int i=0;i<4;i++){
int nx=now.x+dx[i],ny=now.y+dy[i];
if(nx>=0&&nx<h&&ny>=0&&ny<w&&mat[nx][ny]!='#'&&!vis[nx][ny]){
Point tp=(Point){nx,ny};
q.push(tp);
vis[nx][ny]=vis[now.x][now.y]+1;
if(id[tp]){
dist[x][id[tp]-1]=vis[now.x][now.y];
}
}
}
}
}
d[1<<s][s]=0;
int M=path.size();
for(int i=0;i<(1<<M);++i){
for(int j=0;j<M;j++){
int p=path[j];
for(int k=0;1<<k<=i;k++){
if(i&(1<<k)){
d[i|(1<<p)][p]=min(d[i|(1<<p)][p],d[i][k]+dist[k][p]);
}
}
}
}
cout<<d[(1<<M)-1][e]<<endl;
}
return 0;
}
Here are 3 specific questions I have about it:
What is the use of the constant INF? It isn’t used anywhere in the program. I understand that programmers very often leave some things in their programs which may not seem to be of any use presently, but would be useful for any future modifications. Does INF serve that same purpose? If any kind of modification is performed to make the program more efficient or to use a different method, INF is used?
The use of the left-shift operator inside the array dimensions. For example, int d[1<<20][20]. What purpose does the let-shift operator accomplish with regard to this program? There are various other instances where the let shift operator has been used inside array dimensions, and I can’t understand why.
The overloading of the less-than operator. In the Point structure, the less-than operator is overloaded. But I can't seem to find out where in the program it has been called. It needs a Point object to call it, but I can’t find any place where any Point object calls that member function.
Your questions aren't invalid, but do not need all the context to ask them. They could each be separate questions, and I've provided a link for each showing that the essence of the question has been asked before more succinctly. If you isolate your questions and separate them out of the specific body of code you are looking at, that's better--they can be triaged more easily as duplicates.
What is the use of the constant INF?It isn’t used anywhere in the program. I understand that programmers very often leave some things in their programs which may not seem to be of any use presently, but would be useful for any future modifications. Does INF serve that same purpose? If any kind of modification is performed to make the program more efficient or to use a different method, INF is used?
If you delete the line declaring INF, does it still compile and work? Does it get slower? If so, it is a magic incantation that makes programs faster, known only in C++ secret societies. :-) If not, it's just a leftover definition as you suspect...perhaps used at some time, or perhaps never was.
See:
How do I detect unused macro definitions & typedefs?
The use of the left-shift operator inside the array dimensions. For example, int d[1<<20][20]. What purpose does the let-shift operator accomplish with regard to this program? There are various other instances where the let shift operator has been used inside array dimensions, and I can’t understand why.
In binary math, shifting 1 some number of bits left is the same as raising 2 to that power. So 1 << 20 is 2^20, or 1048576. It's faster to bit shift than to call a power function, although with an optimized enough power function that can special case when the base is 2...how much faster may not be that much:
are 2^n exponent calculations really less efficient than bit-shifts?
The overloading of the less-than operator. In the Point structure, the less-than operator is overloaded. But I can’t seem to find out where in the program it has been called. It needs a Point object to call it, but I can’t find any place where any Point object calls that member function.
One might think that if you want to test if a method is ever called or a definition used, you can delete it and see if it still compiles. But in C++ that doesn't always work; some definitions are overloads. If you delete them, the program still compiles but just falls through to more basic behavior. Even preprocessor macros can be funny because one file might detect if it had been defined elsewhere, and do something different if not...
There are other approaches, like just throwing an exception or asserting if it's ever called in the course of running. People offer some other thoughts here:
Find out if a function is called within a C++ project?
As #BrianSchlenker points out, the less than operator is definitely used despite the lack of explicit calls in the code shown. It's used to order the elements of map<Point,int> id;. The C++ std::map type imposes ordering on its contents, and defaults to using operator< to achieve this ordering...though you may override this. If you print something out inside the less than function, you'll see it called every time the id map is interacted with.
(Note: If you want an unordered map you have to use std::unordered_map, but that requires your datatype to have a different ability to calculate its std::hash...as well as a test for equality.)
In general: this code is not stylized in a maintainable or readable manner. I'd suggest that if you want to learn methods for increasing C++ program performance, you avoid the tarpit of reading any piece of obfuscated code you find...just because it happened to catch your attention.
Can you learn from it? I guess, but de-obfuscating it and commenting it will be your first step. Not a great idea, especially if you have to go asking others to help you do it, because even if they know how...they probably don't want to. Better would be to work through steps to improve your own implementation in a stable logical way, where you don't step too far outside of your sphere of understanding in any one step.
(Though if you can find the original author of such things, you might be able to engage them in a conversation about it and comment it for you. If they don't have the interest, why would random people on the Internet?)

2 while loops vs if else statement in 1 while loop

First a little introduction:
I'm a novice C++ programmer (I'm new to programming) writing a little multiplication tables practising program. The project started as a small program to teach myself the basics of programming and I keep adding new features as I learn more and more about programming. At first it just had basics like ask for input, loops and if-else statements. But now it uses vectors, read and writes to files, creates a directory etc.
You can see the code here: Project on Bitbucket
My program now is going to have 2 modes: practise a single multiplication table that the user can choose himself or practise all multiplication tables mixed. Now both modes work quite different internally. And I developed the mixed mode as a separate program, as would ease the development, I could just focus on writing the code itself instead of also bothering how I will integrate it in the existing code.
Below the code of the currently separate mixed mode program:
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <algorithm>
#include <time.h>
using namespace std;
using std::string;
int newquestion(vector<int> remaining_multiplication_tables, vector<int> multiplication_tables, int table_selecter){
cout << remaining_multiplication_tables[table_selecter] << " * " << multiplication_tables[remaining_multiplication_tables[table_selecter]-1]<< " =" << "\n";
return remaining_multiplication_tables[table_selecter] * multiplication_tables[remaining_multiplication_tables[table_selecter]-1];
}
int main(){
int usersanswer_int;
int cpu_answer;
int table_selecter;
string usersanswer;
vector<int> remaining_multiplication_tables = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
vector<int> multiplication_tables(10, 1);//fill vector with 10 elements that contain the value '1'. This vector will store the "progress" of each multiplication_table.
srand(time(0));
table_selecter = rand() % remaining_multiplication_tables.size();
cpu_answer = newquestion(remaining_multiplication_tables, multiplication_tables, table_selecter);
while(remaining_multiplication_tables.size() != 0){
getline(cin, usersanswer);
stringstream usersanswer_stream(usersanswer);
usersanswer_stream >> usersanswer_int;
if(usersanswer_int == cpu_answer){
cout << "Your answer is correct! :)" << "\n";
if(multiplication_tables[remaining_multiplication_tables[table_selecter]-1] == 10){
remaining_multiplication_tables.erase(remaining_multiplication_tables.begin() + table_selecter);
}
else{
multiplication_tables[remaining_multiplication_tables[table_selecter]-1] +=1;
}
if (remaining_multiplication_tables.size() != 0){
table_selecter = rand() % remaining_multiplication_tables.size();
cpu_answer = newquestion(remaining_multiplication_tables, multiplication_tables, table_selecter);
}
}
else{
cout << "Unfortunately your answer isn't correct! :(" << "\n";
}
}
return 0;
}
As you can see the newquestion function for the mixed mode is quite different. Also the while loop includes other mixed mode specific code.
Now if I want to integrate the mixed multiplication tables mode into the existing main program I have 2 choices:
-I can clutter up the while loop with if-else statements to check each time the loop runs whether mode == 10 (single multiplication table mode) or mode == 100 (mixed multiplication tables mode). And also place a if-else statement in the newquestion() function to check if mode == 10 or mode == 100
-I can let the program check on startup whether the user chose single multiplication table or mixed multiplication tables mode and create 2 while loops and 2 newquestion() functions. That would look like this:
int newquestion_mixed(){
//newquestion function for mixed mode
}
int newquestion_single(){
//newquestion function for single mode
}
//initialization
if mode == 10
//create necessary variables for single mode
while(){
//single mode loop
}
else{
//create necessary variables for mixed mode
while(){
//mixed mode loop
}
}
Now why would I bother creating 2 separate loops and functions? Well isn't it inefficient if the program checks each time the loop runs (each time the user is asked a new question, for example: '5 * 3 =') which mode the user chose? I'm worried about the performance with this option. Now I hear you think: but why would you bother about performance for such a simple, little non-performance critical application with the extremely powerful processors today and the huge amounts of RAM? Well, as I said earlier this program is mainly about teaching myself a good coding style and learning how to program etc. So I want to teach myself the good habits from the beginning.
The 2 while loops and functions option is much more efficient will use less CPU, but more space and includes duplicating code. I don't know if this is a good style either.
So basically I'm asking the experts what's the best style/way to handle this kind of things. Also if you spot something bad in my code/bad style please tell me, I'm very open to feedback because I'm still a novice. ;)
First, a fundamental rule of programming is that of "don't prematurely optimize the code" - that is, don't fiddle around with little details, before you have the code working correctly, and write code that expresses what you want done as clearly as possible. This is good coding style. To obsess over the details of "which is faster" (in a loop that spends most of it's time waiting for the user to input some number) is not good coding style.
Once it's working correcetly, analyse (using for example a profiler tool) where the code is spending it's time (assuming performance is a major factor in the first place). Once you have located the major "hotspot", then try to make that better in some way - how you go about that depends very much on what that particular hot-spot code does.
As to which performs best will highly depend on details the code and the compiler (and which compiler optimizations are chosen). It is quite likely that having an if inside a while-loop will run slower, but modern compilers are quite clever, and I have certainly seen cases where the compiler hoists such a choice out of the loop, in cases where the conditions don't change. Having two while-loops is harder for the compiler to "make better", because it most likely won't see that you are doing the same thing in both loops [because the compiler works from the bottom of the parse-tree up, and it will optimize the inside of the while-loop first, then go out to the if-else side, and at that point it's "lost track" of what's going on inside each loop].
Which is clearer, to have one while loop with an if inside, or an if with two while-loops, that's another good question.
Of course, the object oriented solution is to have two classes - one for mixed, another for single - and just run one loop, that calls the relevant (virtual) member function of the object created based on an if-else statement before the loop.
Modern CPU branch predictors are so good that if during the loop the condition never changes, it will probably be as fast as having two while loops in each branch.

How to correctly benchmark a [templated] C++ program

< backgound>
I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.
< /background>
Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less).
Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.
However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.
Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).
The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.
If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.
So how can I do a correct benchmark of a little feature extracted from a library ?
Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?
Thanks for advices !
There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.
I think I will try all of these.
Thanks a lot for all your answers!
If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.
template<typename T> void sink(T const& t) {
volatile T sinkhole = t;
}
No iostream overhead, just a copy that has to remain in the generated code.
Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.
Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.
Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.
Compilers are only allowed to eliminate code-branches that can not happen. As long as it cannot rule out that a branch should be executed, it will not eliminate it. As long as there is some data dependency somewhere, the code will be there and will be run. Compilers are not too smart about estimating which aspects of a program will not be run and don't try to, because that's a NP problem and hardly computable. They have some simple checks such as for if (0), but that's about it.
My humble opinion is that you were possibly hit by some other problem earlier on, such as the way C/C++ evaluates boolean expressions.
But anyways, since this is about a test of speed, you can check that things get called for yourself - run it once without, then another time with a test of return values. Or a static variable being incremented. At the end of the test, print out the number generated. The results will be equal.
To answer your question about in-vitro testing: Yes, do that. If your app is so time-critical, do that. On the other hand, your description hints at a different problem: if your deltas are in a timeframe of 1e-3 seconds, then that sounds like a problem of computational complexity, since the method in question must be called very, very often (for few runs, 1e-3 seconds is neglectible).
The problem domain you are modeling sounds VERY complex and the datasets are probably huge. Such things are always an interesting effort. Make sure that you absolutely have the right data structures and algorithms first, though, and micro-optimize all you want after that. So, I'd say look at the whole context first. ;-)
Out of curiosity, what is the problem you are calculating?
You have a lot of control on the optimizations for your compilation. -O1, -O2, and so on are just aliases for a bunch of switches.
From the man pages
-O2 turns on all optimization flags specified by -O. It also turns
on the following optimization flags: -fthread-jumps -falign-func‐
tions -falign-jumps -falign-loops -falign-labels -fcaller-saves
-fcrossjumping -fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks -fexpensive-optimizations -fgcse
-fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐
order-blocks -freorder-functions -frerun-cse-after-loop
-fsched-interblock -fsched-spec -fschedule-insns -fsched‐
ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre
-ftree-vrp
You can tweak and use this command to help you narrow down which options to investigate.
...
Alternatively you can discover which binary optimizations are
enabled by -O3 by using:
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled
Once you find the culpret optimization you shouldn't need the cout's.
If this is possible for you, you might try splitting your code into:
the library you want to test compiled with all optimizations turned on
a test program, dinamically linking the library, with optimizations turned off
Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).
You could create a dummy function in a separate cpp file that does nothing, but takes as argument whatever is the type of your calculation result. Then you can call that function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of invoking a function (which shouldn't skew your results unless you call it a lot!).
#include <iostream>
// Mark coords as extern.
// Compiler is now NOT allowed to optimise away coords
// This it can not remove the loop where you initialise it.
// This is because the code could be used by another compilation unit
extern double coords[500][3];
double coords[500][3];
int main()
{
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
std::cout << "hello world !"<< std::endl;
return 0;
}
edit: the easiest thing you can do is simply use the data in some spurious way after the function has run and outside your benchmarks. Like,
StartBenchmarking(); // ie, read a performance counter
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
StopBenchmarking(); // what comes after this won't go into the timer
// this is just to force the compiler to use coords
double foo;
for (int j = 0 ; j < 500 ; ++j )
{
foo += coords[j][0] + coords[j][1] + coords[j][2];
}
cout << foo;
What sometimes works for me in these cases is to hide the in vitro test inside a function and pass the benchmark data sets through volatile pointers. This tells the compiler that it must not collapse subsequent writes to those pointers (because they might be eg memory-mapped I/O). So,
void test1( volatile double *coords )
{
//perform a simple initialization of all coordinates:
for (int i=0; i<1500; i+=3)
{
coords[i+0] = 3.23;
coords[i+1] = 1.345;
coords[i+2] = 123.998;
}
}
For some reason I haven't figured out yet it doesn't always work in MSVC, but it often does -- look at the assembly output to be sure. Also remember that volatile will foil some compiler optimizations (it forbids the compiler from keeping the pointer's contents in register and forces writes to occur in program order) so this is only trustworthy if you're using it for the final write-out of data.
In general in vitro testing like this is very useful so long as you remember that it is not the whole story. I usually test my new math routines in isolation like this so that I can quickly iterate on just the cache and pipeline characteristics of my algorithm on consistent data.
The difference between test-tube profiling like this and running it in "the real world" means you will get wildly varying input data sets (sometimes best case, sometimes worst case, sometimes pathological), the cache will be in some unknown state on entering the function, and you may have other threads banging on the bus; so you should run some benchmarks on this function in vivo as well when you are finished.
I don't know if GCC has a similar feature, but with VC++ you can use:
#pragma optimize
to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.
Just a small example of an unwanted optimization:
#include <vector>
#include <iostream>
using namespace std;
int main()
{
double coords[500][3];
//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
{
coords[i][0] = 3.23;
coords[i][1] = 1.345;
coords[i][2] = 123.998;
}
cout << "hello world !"<< endl;
return 0;
}
If you comment the code from "double coords[500][3]" to the end of the for loop it will generate exactly the same assembly code (just tried with g++ 4.3.2). I know this example is far too simple, and I wasn't able to show this behavior with a std::vector of a simple "Coordinates" structure.
However I think this example still shows that some optimizations can introduce errors in the benchmark and I wanted to avoid some surprises of this kind when introducing new code in a library. It's easy to imagine that the new context might prevent some optimizations and lead to a very inefficient library.
The same should also apply with virtual functions (but I don't prove it here). Used in a context where a static link would do the job I'm pretty confident that decent compilers should eliminate the extra indirection call for the virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal.
Then I'll call it hundred of thousand times in a context where the compiler cannot guess what will be the exact type of the pointer and have a 20% increase of running time...
at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;
The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.