Visual studio: how to make C++ consoleApplication use more of CPU power?

Visual studio: how to make C++ consoleApplication use more of CPU power? - c++

so i am running a console project, but when the code is running i see in Task manager that only 5% (2.8 GHz) of Cpu is been used, of course i am not exacly sure how cpu distribute the proccessing power in windows to begain with. but for more of a future reffrence i would like to know if i had a performance demanding code that i need the answer faster how would i do that?
here is the code if you would like to know:
#include "stdafx.h"
#include <iostream>
#include <string.h>
using namespace std;
void swap(char *x, char *y)
{
char temp;
temp = *x;
*x = *y;
*y = temp;
}
void permute(char *a, int l, int r)
{
int i;
if (l == r)
cout << a << endl;
else
{
for (i = l; i <= r; i++)
{
swap((a + l), (a + i));
permute(a, l + 1, r);
swap((a + l), (a + i));
}
}
}
int main()
{
char Short[] = "ABCD";
int n1 = strlen(Short);
char Long[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
int n2 = strlen(Long);
while(true)
{
cout << "Would you like to see the permutions of only a) ABCD or b) the whole alphabet?!\n(please enter a or b): ";
char input;
cin >> input;
if (input == 'a')
{
cout << "The permutions of ABCD:\n";
permute(Short, 0, n1 - 1);
cout << "-----------------------------------";
}
else if (input == 'b')
{
cout << "The permutions of Alphabet:\n";
permute(Long, 0, n2 - 1);
cout << "-----------------------------------";
}
else
{
cout << "ERROR! : Enter either a or b.\n";
}
}
}
i found the code in a blog to show the permutions of "ABCD" as part of an assgiment but i also used it for the entire alphabet, and i wanted to know for that use is there a way to make code use more cpu?(it's kinda taking a much longer time than i expected)

Learning to optimize code efficiently is a major challenge for the even experienced coders, and there are volumes of books, articles, and presentations on the topic. As such, a complete treatment is well out of scope for a Stack Overflow question.
That said, here are a few principles:
Focus initially on the algorithm. You can write a messy bubble sort or an efficient one, but in most 'real world' cases quicksort will beat either handily. This is arguably the primary reason the field of computer science exists: the study and selection of algorithms and their theoretical performance.
Related to this, make sure you are comparing your implementation against a 'stock' algorithm when possible. For example, you should see how your implementation performs compared to using the C++11 std::random_shuffle in the <random> header.
Optimize the compiler settings first. Debug builds are never going to be fast, and they aren't supposed to be. Using inline can help, but only happens if the compiler is actually doing inline optimizations. For Visual C++, there are a number of different optimization settings you can try out, but remember that there are tradeoffs so /Ox (maximum optimization) may not always be the right choice, which is why most templates default to /O2 (maximize speed). In some cases, /O1 (minimize space) is actually better.
Always measure performance before and after optimization. Modern out-of-order CPUs are sophisticated systems, and they don't always do what you think they are doing. In many cases, what is a textbook optimization in code actually performs worse than the original code due to various pipelining and microarchitecture effects. The only way to know for sure is to use a good profiler, have solid test cases, and measure the impact of any optimization work. If it's slower on average than before, then revert to the 'unoptimized' version and try something else.
Focus optimization on the hotspots. This is the so-called '80/20' rule. In many applications the vast majority of the code is run rarely, so only a few areas of your application are actually spending enough time running to be worth optimizing.
As a corollary to this rule, having all of your code using extremely inefficient anti-patterns can really hurt the baseline performance of your entire application. For this reason, it's worth knowing how to write good code generally. The point of the 80/20 rule is to spend your limited time optimizing on the areas that will have the most impact rather than what you as the programmer assume matters.
All that said, in your case none of this matters. The vast majority of the CPU time is spent just creating your process and handling the serialized input and output. When dealing with an n of 4 or 26, it doesn't matter how bad or good your algorithm is. In other words, it is highly unlikely permute is your program's 'hotspot' unless you are working with tens of thousands of millions of characters.

NOTE: Yes I am oversimplifying the topic, but I'm concerned that
without this basic understanding, the more advanced topics will
actually lead to some disastrous program designs.
Maybe I'm missing something, but there also seems to be a misunderstanding regarding the link between CPU and efficiency in your mind.
Your program has N instructions, and the CPU will process those N instructions at relatively the same speed (3.56 GHz is about 3.56 billion instructions per second). That's the same (more or less), whether you're getting "5%" or "25%" use of the CPU from a single program. (I'll explain that percentage in a moment.)
The only way to get "faster" in terms of processor usage, as erip said, with parallel computing techniques, which in a nutshell employ multiple CPUs to accomplish the task.
If you think of it like an assembly line, your one worker can only process one widget at a time. If your batch of widgets to him takes up 5% of his time, that means that in order to process ALL of your widgets one-by-one, he uses 5% of his time, and the other 95% is not needed for that batch (and he'll probably use it for some other batches other people assigned him.)
He cannot process more than one widget at a time, so that's as fast as he'll get with your batch. You might be able to make things appear faster by having him alternate between two different types of widgets, instead of finishing all of batch A before starting on batch B, but it will still take the same amount of time in the end to process both batches.
MASSIVE EXCEPTION: If he's spending 100% of his time on someone else's batch of widgets, you're literally going to have to cool your heels. That's not something you can do a thing about.
However, if you add another worker to that assembly line, they can process twice (roughly) the widgets in the same amount of time, because you are processing two widgets at once. When we say you have a "quad core processor", that basically means that you have four workers available (literally 4 CPUs). Each one can only process a single instruction at once, but by assigning more than one to the batch of widgets, you get it done faster.
All of this said, one must keep in mind that those CPUs are doing a lot - they run the entire computer. You want to try and keep those percentages down as much as possible, so your program is fast and responsive on any supported computer. Not all of your users will have 3.46 GHz quad-core machines, after all.

Surely the reason this program is not using all available CPU bandwidth is because it's emitting the permutation results to the screen once for each permutation. This will result in blocking I/O within the implementation of cout.
If you want 100% cpu use you'll want to separate computation from I/O. In this case you'd then need to either:
a) store the results for later output, or
b) communicate results across a thread boundary (which will itself have a an efficiency cost because of the cost of acquiring mutexes and synchronising cache memory), or
c) a combination of the above (batching results and communicating them across the thread boundary)
For a quick check, you could remove comment out all the cout calls and see how much CPU use you get (as mentioned it will be close to 100% divided by the number of CPUs on your computer).

Related

Dealing with heavy profiling of execution times in C++

I'm currently working on a scientific computing project involving huge data and complex algorithms, so I need to do a lot of code profiling. I'm currently relying on <ctime> and clock_t to time the execution of my code. I'm perfectly happy with this solution... except that I'm basically timing everything and thus for every line of real code I have to call start_time_function123 = clock(), end_time_function123 = clock() and cout << "function123 execution time: " << (end_time_function123-start_time_function123) / CLOCKS_PER_SEC << endl. This leads to heavy code bloating and quickly makes my code unreadable. How would you deal with that?
The only solution I can think of would be to find an IDE allowing me to mark portions of my code (at different locations, even in different files) and to toggle hide/show all marked code with one button. This would allow me to hide the part of my code related to profiling most of the time and display it only whenever I want to.

Have a RAII type that marks code as timed.
struct timed {
char const* name;
clock_t start;
timed( char const* name_to_record):
name(name_to_record),
start(clock())
{}
~timed(){
auto end=clock();
std::cout << name << " execution time: " << (end-start) / CLOCKS_PER_SEC << std::endl;
}
};
The use it:
void foo(){
timed timer(__func__);
// code
}
Far less noise.
You can augment with non-scope based finish operations. When doing heavy profiling sometimes I like to include unique ids. Using cout esoecially with endl could result in it dominating timing; fast recording to a large buffer that is dumped out in an async manner may be optimal. If you need to time ms level time, even allocation, locks and string manipulation should be avoided.

You don't say so explicitly, but I assume you are looking for possible speedups - ways to reduce the time it takes.
You think you need to do this by measuring how much time different parts of it take. If you're interested, there's an orthogonal way to approach it.
Just get it running under a debugger (using a non-optimized debug build).
Manually interrupt it at random, by Ctrl-C, Ctrl-Break, or the IDE's "pause" button.
Display the call stack and carefully examine what the program is doing, at all levels.
Do this with a suspicion that whatever it's doing could be something wasteful that you could find a better way to do.
Then if you start it up again, and halt it again, and see it doing the same thing or something similar, you know you will get a substantial speedup if you fix it.
The fewer samples you took to see that thing twice, the more speedup you will get.
That's the random pausing technique, and the statistical rationale is here.
The reason you do it on a debug build is here.
After you've cut out the fat using this method, you can switch to an optimized build and get the extra margin it gives you.

Correct QueryPerformanceCounter Function implementation / Time changes everytime

I have to create a sorting algorithm function that returns number of comparisons, number of copies and number of MICROSECONDS it uses to finish its sorting.
I have seen that to use microseconds i have to use the function QueryPerformance counter as it's accurate (Ps i know it isn't portable between OS)
So i've done that :
void Exchange_sort(int vect[], int dim, int &countconf, int &countcopy, double &time)
{
LARGE_INTEGER a, b, oh, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&a);
QueryPerformanceCounter(&b);
oh.QuadPart = b.QuadPart - a.QuadPart; //Saves in oh the overhead time (?) accuracy
QueryPerformanceCounter(&a);
int i=0,j=0; // The sorting algorithm starts
for (i=0 ; i<dim-1 ; i++)
{ for(j=i+1 ; j<dim; j++ )
{
countconf++; // +1 Comparisons
if (vect[i]>vect[j])
{
scambio ( vect[i],vect[j] ); // It is a function that swaps 2 integers
countcopy=countcopy+3; // +3 copies
}
}
}
QueryPerformanceCounter(&b); // Ends timer
time = ( ( (double)(b.QuadPart - a.QuadPart - oh.QuadPart) /freq.QuadPart )
*1000000 ) ;
}
The *1000000 is actually to give microseconds...
I think like this it should work but everytime i call the function giving it the same dimension of the array, it returns a different time... How can i solve that?
Thank you very much, and sorry for my bad coding

Firstly, the performance counter frequency might not be that great. It's usually several hundred thousand or more, which gives a microsecond or tens of microseconds resolution, but you should be aware that it can be even worse.
Secondly, if your array size is small, your sort might finish in nanoseconds or microseconds, and you would not be able to measure that accurately with QueryPerformanceCounter.
Thirdly, when your benchmark process is running, Windows might take the CPU away from it for a (relatively) long time, milliseconds or maybe even hundreds of milliseconds. This will lead to highly irregular and seemingly erratic timings.
I have two suggestions that you might pursue independently of each other:
I suggest you investigate using the RDTSC instruction (using inline assembly or compiler intrinsics or even an existing library.) Which will most likely give you better resolution with far less overhead. But I have to warn you that it has its own bag of problems.
For this type of benchmark, you have to run your sort routine with the exact same input many times (tens or hundreds) and then take the smallest time measurement. The reason that you should adopt this strategy is that there are a few phenomena that will interfere with your timing and make it longer, but there is nothing that can make your sort go faster than it would on paper. Therefore, you need to run the test many many times and hope to all your gods that the fastest time you've measured is the actual running time with no interference or noise.
UPDATE: Reading through the comments on the question, it seems that you are trying to time a very short-running piece of code with a timer that doesn't have enough resolution. Either increase your input size, or use RDTSC.

The short answer for your question is that it is not possible to measure exactly the same time for all calls of the same function.
The fact that you are receiving different times is expected because your operating system is not a perfect Real-Time System, but a general purpose OS with multiple processes running at the same time and competing to be scheduled by the kernel to get its own CPU cycles.
And also, consider that, each time you execute your program or function, some of its instructions might be located at the RAM, and some might be available at the CPU L1 or L2 cache memory, and it will probably change from one execution to another. So, there are lots of variables to consider when evaluating the elapsed time for function calls using high level of precision.

2 while loops vs if else statement in 1 while loop

First a little introduction:
I'm a novice C++ programmer (I'm new to programming) writing a little multiplication tables practising program. The project started as a small program to teach myself the basics of programming and I keep adding new features as I learn more and more about programming. At first it just had basics like ask for input, loops and if-else statements. But now it uses vectors, read and writes to files, creates a directory etc.
You can see the code here: Project on Bitbucket
My program now is going to have 2 modes: practise a single multiplication table that the user can choose himself or practise all multiplication tables mixed. Now both modes work quite different internally. And I developed the mixed mode as a separate program, as would ease the development, I could just focus on writing the code itself instead of also bothering how I will integrate it in the existing code.
Below the code of the currently separate mixed mode program:
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <algorithm>
#include <time.h>
using namespace std;
using std::string;
int newquestion(vector<int> remaining_multiplication_tables, vector<int> multiplication_tables, int table_selecter){
cout << remaining_multiplication_tables[table_selecter] << " * " << multiplication_tables[remaining_multiplication_tables[table_selecter]-1]<< " =" << "\n";
return remaining_multiplication_tables[table_selecter] * multiplication_tables[remaining_multiplication_tables[table_selecter]-1];
}
int main(){
int usersanswer_int;
int cpu_answer;
int table_selecter;
string usersanswer;
vector<int> remaining_multiplication_tables = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
vector<int> multiplication_tables(10, 1);//fill vector with 10 elements that contain the value '1'. This vector will store the "progress" of each multiplication_table.
srand(time(0));
table_selecter = rand() % remaining_multiplication_tables.size();
cpu_answer = newquestion(remaining_multiplication_tables, multiplication_tables, table_selecter);
while(remaining_multiplication_tables.size() != 0){
getline(cin, usersanswer);
stringstream usersanswer_stream(usersanswer);
usersanswer_stream >> usersanswer_int;
if(usersanswer_int == cpu_answer){
cout << "Your answer is correct! :)" << "\n";
if(multiplication_tables[remaining_multiplication_tables[table_selecter]-1] == 10){
remaining_multiplication_tables.erase(remaining_multiplication_tables.begin() + table_selecter);
}
else{
multiplication_tables[remaining_multiplication_tables[table_selecter]-1] +=1;
}
if (remaining_multiplication_tables.size() != 0){
table_selecter = rand() % remaining_multiplication_tables.size();
cpu_answer = newquestion(remaining_multiplication_tables, multiplication_tables, table_selecter);
}
}
else{
cout << "Unfortunately your answer isn't correct! :(" << "\n";
}
}
return 0;
}
As you can see the newquestion function for the mixed mode is quite different. Also the while loop includes other mixed mode specific code.
Now if I want to integrate the mixed multiplication tables mode into the existing main program I have 2 choices:
-I can clutter up the while loop with if-else statements to check each time the loop runs whether mode == 10 (single multiplication table mode) or mode == 100 (mixed multiplication tables mode). And also place a if-else statement in the newquestion() function to check if mode == 10 or mode == 100
-I can let the program check on startup whether the user chose single multiplication table or mixed multiplication tables mode and create 2 while loops and 2 newquestion() functions. That would look like this:
int newquestion_mixed(){
//newquestion function for mixed mode
}
int newquestion_single(){
//newquestion function for single mode
}
//initialization
if mode == 10
//create necessary variables for single mode
while(){
//single mode loop
}
else{
//create necessary variables for mixed mode
while(){
//mixed mode loop
}
}
Now why would I bother creating 2 separate loops and functions? Well isn't it inefficient if the program checks each time the loop runs (each time the user is asked a new question, for example: '5 * 3 =') which mode the user chose? I'm worried about the performance with this option. Now I hear you think: but why would you bother about performance for such a simple, little non-performance critical application with the extremely powerful processors today and the huge amounts of RAM? Well, as I said earlier this program is mainly about teaching myself a good coding style and learning how to program etc. So I want to teach myself the good habits from the beginning.
The 2 while loops and functions option is much more efficient will use less CPU, but more space and includes duplicating code. I don't know if this is a good style either.
So basically I'm asking the experts what's the best style/way to handle this kind of things. Also if you spot something bad in my code/bad style please tell me, I'm very open to feedback because I'm still a novice. ;)

First, a fundamental rule of programming is that of "don't prematurely optimize the code" - that is, don't fiddle around with little details, before you have the code working correctly, and write code that expresses what you want done as clearly as possible. This is good coding style. To obsess over the details of "which is faster" (in a loop that spends most of it's time waiting for the user to input some number) is not good coding style.
Once it's working correcetly, analyse (using for example a profiler tool) where the code is spending it's time (assuming performance is a major factor in the first place). Once you have located the major "hotspot", then try to make that better in some way - how you go about that depends very much on what that particular hot-spot code does.
As to which performs best will highly depend on details the code and the compiler (and which compiler optimizations are chosen). It is quite likely that having an if inside a while-loop will run slower, but modern compilers are quite clever, and I have certainly seen cases where the compiler hoists such a choice out of the loop, in cases where the conditions don't change. Having two while-loops is harder for the compiler to "make better", because it most likely won't see that you are doing the same thing in both loops [because the compiler works from the bottom of the parse-tree up, and it will optimize the inside of the while-loop first, then go out to the if-else side, and at that point it's "lost track" of what's going on inside each loop].
Which is clearer, to have one while loop with an if inside, or an if with two while-loops, that's another good question.
Of course, the object oriented solution is to have two classes - one for mixed, another for single - and just run one loop, that calls the relevant (virtual) member function of the object created based on an if-else statement before the loop.

Modern CPU branch predictors are so good that if during the loop the condition never changes, it will probably be as fast as having two while loops in each branch.

Best way to test code speed in C++ without profiler, or does it not make sense to try?

On SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are quite a few issues involved and most Q & A ignore all but a few at a time, or don't justify their proposals.
What Im wondering about. If I have two functions that do the same thing, and Im curious about the difference in speed, does it make sense to test this without external tools, with timers, or will this compiled in testing affect the results to much?
I ask this because if it is sensible, as a C++ programmer, I want to know how it should best be done, as they are much simpler than using external tools. If it makes sense, lets proceed with all the possible pitfalls:
Consider this example. The following code shows 2 ways of doing the same thing:
#include <algorithm>
#include <ctime>
#include <iostream>
typedef unsigned char byte;
inline
void
swapBytes( void* in, size_t n )
{
for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi )
in[lo] ^= in[hi]
, in[hi] ^= in[lo]
, in[lo] ^= in[hi] ;
}
int
main()
{
byte arr[9] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };
const int iterations = 100000000;
clock_t begin = clock();
for( int i=iterations; i!=0; --i )
swapBytes( arr, 8 );
clock_t middle = clock();
for( int i=iterations; i!=0; --i )
std::reverse( arr, arr+8 );
clock_t end = clock();
double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC;
double secReve = (double) ( end-middle ) / CLOCKS_PER_SEC;
std::cout << "swapBytes, for: " << iterations << " times takes: " << middle-begin
<< " clock ticks, which is: " << secSwap << "sec." << std::endl;
std::cout << "std::reverse, for: " << iterations << " times takes: " << end-middle
<< " clock ticks, which is: " << secReve << "sec." << std::endl;
std::cin.get();
return 0;
}
// Output:
// Release:
// swapBytes, for: 100000000 times takes: 3000 clock ticks, which is: 3sec.
// std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec.
// Debug:
// swapBytes, for: 10000000 times takes: 1781 clock ticks, which is: 1.781sec.
// std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec.
The issues:
Which timers to use and how get the cpu time actually consumed by the code under question?
What are the effects of compiler optimization (since these functions just swap bytes back and forth, the most efficient thing is obviously to do nothing at all)?
Considering the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If yes, can you explain how std::reverse gets to be so fast, considering the simplicity of the custom function. I don't have the source code from the vc++ version that I used for this test, but here is the implementation from GNU. It boils down to the function iter_swap, which is completely incomprehensible for me. Would this also be expected to run twice as fast as that custom function, and if so, why?
Contemplations:
It seems two high precision timers are being proposed: clock() and QueryPerformanceCounter (on windows). Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements. This page on the gnu c library seems to contradict that, but when I put a breakpoint in vc++, the debugged process gets a lot of clock ticks even though it was suspended (I have not tested under gnu). Am I missing alternative counters for this, or do we need at least special libraries or classes for this? If not, is clock good enough in this example or would there be a reason to use the QueryPerformanceCounter?
What can we know for certain without debugging, dissassembling and profiling tools? Is anything actually happening? Is the function call being inlined or not? When checking in the debugger, the bytes do actually get swapped, but I'd rather know from theory why, than from testing.
Thanks for any directions.
update
Thanks to a hint from tojas the swapBytes function now runs as fast as the std::reverse. I had failed to realize that the temporary copy in case of a byte must be only a register, and thus is very fast. Elegance can blind you.
inline
void
swapBytes( byte* in, size_t n )
{
byte t;
for( int i=0; i<7-i; ++i )
{
t = in[i];
in[i] = in[7-i];
in[7-i] = t;
}
}
Thanks to a tip from ChrisW I have found that on windows you can get the actual cpu time consumed by a (read:your) process trough Windows Management Instrumentation. This definitely looks more interesting than the high precision counter.

Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements.
I do two things, to ensure that wall-clock time and CPU time are approximately the same thing:
Test for a significant length of time, i.e. several seconds (e.g. by testing a loop of however many thousands of iterations)
Test when the machine is more or less relatively idle except for whatever I'm testing.
Alternatively if you want to measure only/more exactly the CPU time per thread, that's available as a performance counter (see e.g. perfmon.exe).
What can we know for certain without debugging, dissassembling and profiling tools?
Nearly nothing (except that I/O tends to be relatively slow).

To answer you main question, it "reverse" algorithm just swaps elements from the array and not operating on the elements of the array.

Use QueryPerformanceCounter on Windows if you need a high-resolution timing. The counter accuracy depends on the CPU but it can go up to per clock pulse. However, profiling in real world operations is always a better idea.

Is it safe to say you're asking two questions?
Which one is faster, and by how much?
And why is it faster?
For the first, you don't need high precision timers. All you need to do is run them "long enough" and measure with low precision timers. (I'm old-fashioned, my wristwatch has a stop-watch function, and it is entirely good enough.)
For the second, surely you can run the code under a debugger and single-step it at the instruction level. Since the basic operations are so simple, you will be able to easily see roughly how many instructions are required for the basic cycle.
Think simple. Performance is not a hard subject. Usually, people are trying to find problems, for which this is a simple approach.

(This answer is specific to Windows XP and the 32-bit VC++ compiler.)
The easiest thing for timing little bits of code is the time-stamp counter of the CPU. This is a 64-bit value, a count of the number of CPU cycles run so far, which is about as fine a resolution as you're going to get. The actual numbers you get aren't especially useful as they stand, but if you average out several runs of various competing approaches then you can compare them that way. The results are a bit noisy, but still valid for comparison purposes.
To read the time-stamp counter, use code like the following:
LARGE_INTEGER tsc;
__asm {
cpuid
rdtsc
mov tsc.LowPart,eax
mov tsc.HighPart,edx
}
(The cpuid instruction is there to ensure that there aren't any incomplete instructions waiting to complete.)
There are four things worth noting about this approach.
Firstly, because of the inline assembly language, it won't work as-is on MS's x64 compiler. (You'll have to create a .ASM file with a function in it. An exercise for the reader; I don't know the details.)
Secondly, to avoid problems with cycle counters not being in sync across different cores/threads/what have you, you may find it necessary to set your process's affinity so that it only runs on one specific execution unit. (Then again... you may not.)
Thirdly, you'll definitely want to check the generated assembly language to ensure that the compiler is generating roughly the code you expect. Watch out for bits of code being removed, functions being inlined, that sort of thing.
Finally, the results are rather noisy. The cycle counters count cycles spent on everything, including waiting for caches, time spent on running other processes, time spent in the OS itself, etc. Unfortunately, it's not possible (under Windows, at least) to time just your process. So, I suggest running the code under test a lot of times (several tens of thousands) and working out the average. This isn't very cunning, but it seems to have produced useful results for me at any rate.

I would suppose that anyone competent enough to answer all your questions is gong to be far too busy to answer all your questions. In practice it is probably more effective to ask a single, well-defined questions. That way you may hope to get well-defined answers which you can collect and be on your way to wisdom.
So, anyway, perhaps I can answer your question about which clock to use on Windows.
clock() is not considered a high precision clock. If you look at the value of CLOCKS_PER_SEC you will see it has a resolution of 1 millisecond. This is only adequate if you are timing very long routines, or a loop with 10000's of iterations. As you point out, if you try and repeat a simple method 10000's of times in order to get a time that can be measured with clock() the compiler is liable to step in and optimize the whole thing away.
So, really, the only clock to use is QueryPerformanceCounter()

Is there something you have against profilers? They help a ton. Since you are on WinXP, you should really give a trial of vtune a try. Try a call graph sampling test and look at self time and total time of the functions being called. There's no better way to tune your program so that it's the fastest possible without being an assembly genius (and a truly exceptional one).
Some people just seem to be allergic to profilers. I used to be one of those and thought I knew best about where my hotspots were. I was often correct about obvious algorithmic inefficiencies, but practically always incorrect about more micro-optimization cases. Just rewriting a function without changing any of the logic (ex: reordering things, putting exceptional case code in a separate, non-inlined function, etc) can make functions a dozen times faster and even the best disassembly experts usually can't predict that without the profiler.
As for relying on simplistic timing tests alone, they are extremely problematic. That current test is not so bad but it's a very common mistake to write timing tests in ways in which the optimizer will optimize out dead code and end up testing the time it takes to do essentially a nop or even nothing at all. You should have some knowledge to interpret the disassembly to make sure the compiler isn't doing this.
Also timing tests like this have a tendency to bias the results significantly since a lot of them just involve running your code over and over in the same loop, which tends to simply test the effect of your code when all the memory in the cache with all the branch prediction working perfectly for it. It's often just showing you best case scenarios without showing you the average, real-world case.
Depending on real world timing tests is a little bit better; something closer to what your application will be doing at a high level. It won't give you specifics about what is taking what amount of time, but that's precisely what the profiler is meant to do.

Wha? How to measure speed without a profiler? The very act of measuring speed is profiling! The question amounts to, "how can I write my own profiler?" And the answer is clearly, "don't".
Besides, you should be using std::swap in the first place, which complete invalidates this whole pointless pursuit.
-1 for pointlessness.

Performance of comparisons in C++ ( foo >= 0 vs. foo != 0 )

I've been working on a piece of code recently where performance is very important, and essentially I have the following situation:
int len = some_very_big_number;
int counter = some_rather_small_number;
for( int i = len; i >= 0; --i ){
while( counter > 0 && costly other stuff here ){
/* do stuff */
--counter;
}
/* do more stuff */
}
So here I have a loop that runs very often and for a certain number of runs the while block will be executed as well until the variable counter is reduced to zero and then the while loop will not be called because the first expression will be false.
The question is now, if there is a difference in performance between using
counter > 0 and counter != 0?
I suspect there would be, does anyone know specifics about this.

To measure is to know.

Do you think that what will solve your problem! :D
if(x >= 0)
00CA1011 cmp dword ptr [esp],0
00CA1015 jl main+2Ch (0CA102Ch) <----
...
if(x != 0)
00CA1026 cmp dword ptr [esp],0
00CA102A je main+3Bh (0CA103Bh) <----

In programming, the following statement is the sign designating the road to Hell:
I've been working on a piece of code recently where performance is very important
Write your code in the cleanest, most easy to understand way. Period.
Once that is done, you can measure its runtime. If it takes too long, measure the bottlenecks, and speed up the biggest ones. Keep doing that until it is fast enough.
The list of projects that failed or suffered catastrophic loss due to a misguided emphasis on blind optimization is large and tragic. Don't join them.

I think you're spending time optimizing the wrong thing. "costly other stuff here", "do stuff" and "do more stuff" are more important to look at. That is where you'll make the big performance improvements I bet.

There will be a huge difference if the counter starts with a negative number. Otherwise, on every platform I'm familiar with, there won't be a difference.

Is there a difference between counter > 0 and counter != 0? It depends on the platform.
A very common type of CPU are those from Intel we have in our PC's. Both comparisons will map to a single instruction on that CPU and I assume they will execute at the same speed. However, to be certain you will have to perform your own benchmark.

As Jim said, when in doubt see for yourself :
#include <boost/date_time/posix_time/posix_time.hpp>
#include <iostream>
using namespace boost::posix_time;
using namespace std;
void main()
{
ptime Before = microsec_clock::universal_time(); // UTC NOW
// do stuff here
ptime After = microsec_clock::universal_time(); // UTC NOW
time_duration delta_t = After - Before; // How much time has passed?
cout << delta_t.total_seconds() << endl; // how much seconds total?
cout << delta_t.fractional_seconds() << endl; // how much microseconds total?
}
Here's a pretty nifty way of measuring time. Hope that helps.

OK, you can measure this, sure. However, these sorts of comparisons are so fast that you are probably going to see more variation based on processor swapping and scheduling then on this single line of code.
This smells of unnecessary, and premature, optimization. Right your program, optimize what you see. If you need more, profile, and then go from there.

I would add that the overwhelming performance aspects of this code on modern cpus will be dominated not by the comparison instruction but whether the comparison is well predicted since any mis-predict will waste many more cycles than any integral operation.
As such loop unrolling will probably be the biggest winner but measure, measure, measure.

Thinking that the type of comparison is going to make a difference, without knowing it, is the definition of guessing.
Don't guess.

In general, they should be equivalent (both are usually implemented in single-cycle instructions/micro-ops). Your compiler may do some strange special-case optimization that is difficult to reason about from the source level, which may make either one slightly faster. Also, equality testing is more energy-efficient than inequality testing (>), though the system-level effect is so small as to not merit discussion.

There may be no difference. You could try examining the assembly output for each.
That being said, the only way to tell if any difference is significant is to try it both ways and measure. I'd bet that the change makes no difference whatsoever with optimizations on.

Assuming you are developing for the x86 architecture, when you look at the assembly output it will come down to jns vs jne. jns will check the sign flag and jne will check the zero flag. Both operations, should as far as I know, be equally costly.

Clearly the solution is to use the correct data type.
Make counter an unsigned int. Then it can't be less than zero. Your compiler will obviously know this and be forced to choose the optimal solution.
Or you could just measure it.
You could also think about how it would be implemented...(here we go on a tangent)...
less than zero: the sign bit would be set, so need to check 1 bit
equal to zero : the whole value would be zero, so need to check all the bits
Of course, computers are funny things, and it may take longer to check a single bit than the whole value (however many bytes it is on your platform).
You could just measure it...
And you could find out that one it more optimal than another (under the conditions you measured it). But your program will still run like a dog because you spent all your time optimising the wrong part of your code.
The best solution is to use what many large software companies do - blame the hardware for not runnnig fast enough and encourage your customer to upgrade their equipment (which is clearly inferior since your product doesn't run fast enough).
< /rant>

I stumbled across this question just now, 3 years after it is asked, so I am not sure how useful the answer will still be... Still, I am surprised not to see clearly stated that answering your question requires to know two and only two things:
which processor you target
which compiler you work with
To the first point, each processor has different instructions for tests. On one given processor, two similar comparisons may turn up to take a different number of cycles. For example, you may have a 1-cycle instruction to do a gt (>), eq (==), or a le (<=), but no 1-cycle instruction for other comparisons like a ge (>=). Following a test, you may decide to execute conditional instructions, or, more often, as in your code example, take a jump. There again, cycles spent in jumps take a variable number of cycles on most high-end processors, depending whether the conditional jump is taken or not taken, predicted or not predicted. When you write code in assembly and your code is time critical, you can actually take quite a bit of time to figure out how to best arrange your code to minimize overall the cycle count and may end up in a solution that may have to be optimized based on the number of time a given comparison returns a true or false.
Which leads me to the second point: compilers, like human coders, try to arrange the code to take into account the instructions available and their latencies. Their job is harder because some assumptions an assembly code would know like "counter is small" is hard (not impossible) to know. For trivial cases like a loop counter, most modern compilers can at least recognize the counter will always be positive and that a != will be the same as a > and thus generate the best code accordingly. But that, as many mentioned in the posts, you will only know if you either run measurements, or inspect your assembly code and convince yourself this is the best you could do in assembly. And when you upgrade to a new compiler, you may then get a different answer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js