Measuing CPU clock speed

Measuing CPU clock speed - c++

I am trying to measure the speed of the CPU.I am not sure how much my method is accurate. Basicly, I tried an empty for loop with values like UINT_MAX but the program terminated quickly so I tried UINT_MAX * 3 and so on...
Then I realized that the compiler is optimizing away the loop, so I added a volatile variable to prevent optimization. The following program takes 1.5 seconds approximately to finish. I want to know how accurate is this algorithm for measuring the clock speed. Also,how do I know how many core's are being involved in the process?
#include <iostream>
#include <limits.h>
#include <time.h>
using namespace std;
int main(void)
{
volatile int v_obj = 0;
unsigned long A, B = 0, C = UINT32_MAX;
clock_t t1, t2;
t1 = clock();
for (A = 0; A < C; A++) {
(void)v_obj;
}
t2 = clock();
std::cout << (double)(t2 - t1) / CLOCKS_PER_SEC << std::endl;
double t = (double)(t2 - t1) / CLOCKS_PER_SEC;
unsigned long clock_speed = (unsigned long)(C / t);
std::cout << "Clock speed : " << clock_speed << std::endl;
return 0;
}

This doesn't measure clock speed at all, it measures how many loop iterations can be done per second. There's no rule that says one iteration will run per clock cycle. It may be the case, and you may have actually found it to be the case - certainly with optimized code and a reasonable CPU, a useless loop shouldn't run much slower than that. It could run at half speed though, some processors are not able to retire more than 1 taken branch every 2 cycles. And on esoteric targets, all bets are off.
So no, this doesn't measure clock cycles, except accidentally. In general it's extremely hard to get an empirical clock speed (you can ask your OS what it thinks the maximum clock speed and current clock speed are, see below), because
If you measure how much wall clock time a loop takes, you must know (at least approximately) the number of cycles per iteration. That's a bad enough problem in assembly, requiring fairly detailed knowledge of the expected microarchitectures (maybe a long chain of dependent instructions that each could only reasonably take 1 cycle, like add eax, 1? a long enough chain that differences in the test/branch throughput become small enough to ignore), so obviously anything you do there is not portable and will have assumptions built into it may become false (actually there is an other answer on SO that does this and assumes that addps has a latency of 3, which it doesn't anymore on Skylake, and didn't have on old AMDs). In C? Give up now. The compiler might be rolling some random code generator, and relying on it to be reasonable is like doing the same with a bear. Guessing the number of cycles per iteration of code you neither control nor even know is just folly. If it's just on your own machine you can check the code, but then you could just check the clock speed manually too so..
If you measure the number of clock cycles elapsed in a given amount of wall clock time.. but this is tricky. Because rdtsc doesn't measure clock cycles (not anymore), and nothing else gets any closer. You can measure something, but with frequency scaling and turbo, it generally won't be actual clock cycles. You can get actual clock cycles from a performance counter, but you can't do that from user mode. Obviously any way you try to do this is not portable, because you can't portably ask for the number of elapsed clock cycles.
So if you're doing this for actual information and not just to mess around, you should probably just ask the OS. For Windows, query WMI for CurrentClockSpeed or MaxClockSpeed, whichever one you want. On Linux there's stuff in /proc/cpuinfo. Still not portable, but then, no solution is.
As for
how do I know how many core's are being involved in the process?
1. Of course your thread may migrate between cores, but since you only have one thread, it's on only one core at any time.

A good optimizer may remove the loop, since
for (A = 0; A < C; A++) {
(void)v_obj;
}
has the same effect on the program state as;
A = C;
So the optimizer is entirely free to unwind your loop.
So you cannot measure CPU speed this way as it depends on the compiler as much as it does on the computer (not to mention the variable clock speed and multicore architecture already mentioned)

Related

Correct QueryPerformanceCounter Function implementation / Time changes everytime

I have to create a sorting algorithm function that returns number of comparisons, number of copies and number of MICROSECONDS it uses to finish its sorting.
I have seen that to use microseconds i have to use the function QueryPerformance counter as it's accurate (Ps i know it isn't portable between OS)
So i've done that :
void Exchange_sort(int vect[], int dim, int &countconf, int &countcopy, double &time)
{
LARGE_INTEGER a, b, oh, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&a);
QueryPerformanceCounter(&b);
oh.QuadPart = b.QuadPart - a.QuadPart; //Saves in oh the overhead time (?) accuracy
QueryPerformanceCounter(&a);
int i=0,j=0; // The sorting algorithm starts
for (i=0 ; i<dim-1 ; i++)
{ for(j=i+1 ; j<dim; j++ )
{
countconf++; // +1 Comparisons
if (vect[i]>vect[j])
{
scambio ( vect[i],vect[j] ); // It is a function that swaps 2 integers
countcopy=countcopy+3; // +3 copies
}
}
}
QueryPerformanceCounter(&b); // Ends timer
time = ( ( (double)(b.QuadPart - a.QuadPart - oh.QuadPart) /freq.QuadPart )
*1000000 ) ;
}
The *1000000 is actually to give microseconds...
I think like this it should work but everytime i call the function giving it the same dimension of the array, it returns a different time... How can i solve that?
Thank you very much, and sorry for my bad coding

Firstly, the performance counter frequency might not be that great. It's usually several hundred thousand or more, which gives a microsecond or tens of microseconds resolution, but you should be aware that it can be even worse.
Secondly, if your array size is small, your sort might finish in nanoseconds or microseconds, and you would not be able to measure that accurately with QueryPerformanceCounter.
Thirdly, when your benchmark process is running, Windows might take the CPU away from it for a (relatively) long time, milliseconds or maybe even hundreds of milliseconds. This will lead to highly irregular and seemingly erratic timings.
I have two suggestions that you might pursue independently of each other:
I suggest you investigate using the RDTSC instruction (using inline assembly or compiler intrinsics or even an existing library.) Which will most likely give you better resolution with far less overhead. But I have to warn you that it has its own bag of problems.
For this type of benchmark, you have to run your sort routine with the exact same input many times (tens or hundreds) and then take the smallest time measurement. The reason that you should adopt this strategy is that there are a few phenomena that will interfere with your timing and make it longer, but there is nothing that can make your sort go faster than it would on paper. Therefore, you need to run the test many many times and hope to all your gods that the fastest time you've measured is the actual running time with no interference or noise.
UPDATE: Reading through the comments on the question, it seems that you are trying to time a very short-running piece of code with a timer that doesn't have enough resolution. Either increase your input size, or use RDTSC.

The short answer for your question is that it is not possible to measure exactly the same time for all calls of the same function.
The fact that you are receiving different times is expected because your operating system is not a perfect Real-Time System, but a general purpose OS with multiple processes running at the same time and competing to be scheduled by the kernel to get its own CPU cycles.
And also, consider that, each time you execute your program or function, some of its instructions might be located at the RAM, and some might be available at the CPU L1 or L2 cache memory, and it will probably change from one execution to another. So, there are lots of variables to consider when evaluating the elapsed time for function calls using high level of precision.

Why is the CPU time different with every execution of this program?

I have a hard time understanding processor time. The result of this program:
#include <iostream>
#include <chrono>
// the function f() does some time-consuming work
void f()
{
volatile long double d;
int size = 10000;
for(int n=0; n<size; ++n)
for(int m=0; m<size; ++m)
d = n*m;
}
int main()
{
std::clock_t start = std::clock();
f();
std::clock_t end = std::clock();
std::cout << "CPU time used: "
<< (end - start)
<< "\n";
}
Seems to randomly fluctuate between 210 000, 220 000 and 230 000. At first I was amazed, why these discrete values. Then I found out that std::clock() returns only approximate processor time. So probably the value returned by std::clock() is rounded to a multiple of 10 000. This would also explain why the maximum difference between the CPU times is 20 000 (10 000 == rounding error by the first call to std::clock() and 10 000 by the second).
But if I change to int size = 40000; in the body of f(), I get fluctuations in the ranges of 3 400 000 to 3 500 000 which cannot be explained by rounding.
From what I read about the clock rate, on Wikipedia:
The CPU requires a fixed number of clock ticks (or clock cycles) to
execute each instruction. The faster the clock, the more instructions
the CPU can execute per second.
That is, if the program is deterministic (which I hope mine is), the CPU time needed to finish should be:
Always the same
Slightly higher than the number of instructions carried out
My experiments show neither, since my program needs to carry out at least 3 * size * size instructions. Could you please explain what I am doing wrong?

First, the statement you quote from Wikipedia is simply false.
It might have been true 20 years ago (but not always, even
then), but it is totally false today. There are many things
which can affect your timings:
The first: if you're running on Windows, clock is broken,
and totally unreliable. It returns the difference in elapsed
time, not CPU time. And elapsed time depends on all sorts of
other things the processor might be doing.
Beyond that: things like cache misses have a very significant
impact on time. And whether a particular piece of data is in
the cache or not can depend on whether your program was
interrupted between the last access and this one.
In general, anything less than 10% can easily be due to the
caching issues. And I've seen differences of a factor of 10
under Windows, depending on whether there was a build running or
not.

You don't state what hardware you're running the binary on.
Does it have an interrupt driven CPU ?
Is it a multitasking operating system ?
You're mistaking the cycle time of the CPU (the CPU clock as Wikipedia refers to) with the time it takes to execute a particular piece of code from start to end and all the other stuff the poor CPU has to do at the same time.
Also ... is all your executing code in level 1 cache, or is some in level 2 or in main memory, or on disk ... what about the next time you run it ?

Your program is not deterministic, because it uses library and system functions which are not deterministic.
As a particular example, when you allocate memory this is virtual memory, which must be mapped to physical memory. Although this is a system call, running kernel code, it takes place on your thread and will count against your clock time. How long it takes to do this will depend on what the overall memory allocation situation is.

The CPU time is indeed "fixed" for a given set of circumstances. However, in a modern computer, there are other things happening in the system, which interferes with the execution of your code. It may be that caches are being wiped out when your email software wakes up to check if there is any new emails for you, or when the HP printer software checks for updates, or when the antivirus software decides to run for a little bit checking if your memory contains any viruses, etc, etc, etc, etc.
Part of this is also caused by the problem that CPU time accounting in any system is not 100% accurate - it works on "clock-ticks" and similar things, so the time used by for example an interrupt to service a network packet coming in, or the hard disk servicing interrupt, or the timer interrupt to say "another millisecond ticked by" these all account into "the currently running process". Assuming this is Windows, there is a further "feature", and that is that for historical and other reasons, std::clock() simply returns the time now, not actually the time used by your process. So for exampple:
t = clock();
cin >> x;
t = clock() - t;
would leave t with a time of 10 seconds if it took ten seconds to input the value of x, even though 9.999 of those ten seconds were spent in the idle process, not your program.

C++ , Timer, Milliseconds

#include <iostream>
#include <conio.h>
#include <ctime>
using namespace std;
double diffclock(clock_t clock1,clock_t clock2)
{
double diffticks=clock1-clock2;
double diffms=(diffticks)/(CLOCKS_PER_SEC/1000);
return diffms;
}
int main()
{
clock_t start = clock();
for(int i=0;;i++)
{
if(i==10000)break;
}
clock_t end = clock();
cout << diffclock(start,end)<<endl;
getch();
return 0;
}
So my problems comes to that it returns me a 0, well to be stright i want to check how much time my program does operate...
I found tons of crap over the internet well mostly it comes to the same point of getting a 0 beacuse the start and the end is the same
This problems goes to C++ remeber : <

There are a few problems in here. The first is that you obviously switched start and stop time when passing to diffclock() function. The second problem is optimization. Any reasonably smart compiler with optimizations enabled would simply throw the entire loop away as it does not have any side effects. But even you fix the above problems, the program would most likely still print 0. If you try to imagine doing billions operations per second, throw sophisticated out of order execution, prediction and tons of other technologies employed by modern CPUs, even a CPU may optimize your loop away. But even if it doesn't, you'd need a lot more than 10K iterations in order to make it run longer. You'd probably need your program to run for a second or two in order to get clock() reflect anything.
But the most important problem is clock() itself. That function is not suitable for any time of performance measurements whatsoever. What it does is gives you an approximation of processor time used by the program. Aside of vague nature of the approximation method that might be used by any given implementation (since standard doesn't require it of anything specific), POSIX standard also requires CLOCKS_PER_SEC to be equal to 1000000 independent of the actual resolution. In other words — it doesn't matter how precise the clock is, it doesn't matter at what frequency your CPU is running. To put simply — it is a totally useless number and therefore a totally useless function. The only reason why it still exists is probably for historical reasons. So, please do not use it.
To achieve what you are looking for, people have used to read the CPU Time Stamp also known as "RDTSC" by the name of the corresponding CPU instruction used to read it. These days, however, this is also mostly useless because:
Modern operating systems can easily migrate the program from one CPU to another. You can imagine that reading time stamp on another CPU after running for a second on another doesn't make a lot of sense. It is only in latest Intel CPUs the counter is synchronized across CPU cores. All in all, it is still possible to do this, but a lot of extra care must be taken (i.e. once can setup the affinity for the process, etc. etc).
Measuring CPU instructions of the program oftentimes does not give an accurate picture of how much time it is actually using. This is because in real programs there could be some system calls where the work is performed by the OS kernel on behalf of the process. In that case, that time is not included.
It could also happen that OS suspends an execution of the process for a long time. And even though it took only a few instructions to execute, for user it seemed like a second. So such a performance measurement may be useless.
So what to do?
When it comes to profiling, a tool like perf must be used. It can track a number of CPU clocks, cache misses, branches taken, branches missed, a number of times the process was moved from one CPU to another, and so on. It can be used as a tool, or can be embedded into your application (something like PAPI).
And if the question is about actual time spent, people use a wall clock. Preferably, a high-precision one, that is also not a subject to NTP adjustments (monotonic). That shows exactly how much time elapsed, no matter what was going on. For that purpose clock_gettime() can be used. It is part of SUSv2, POSIX.1-2001 standard. Given that use you getch() to keep the terminal open, I'd assume you are using Windows. There, unfortunately, you don't have clock_gettime() and the closest thing would be performance counters API:
BOOL QueryPerformanceFrequency(LARGE_INTEGER *lpFrequency);
BOOL QueryPerformanceCounter(LARGE_INTEGER *lpPerformanceCount);
For a portable solution, the best bet is on std::chrono::high_resolution_clock(). It was introduced in C++11, but is supported by most industrial grade compilers (GCC, Clang, MSVC).
Below is an example of how to use it. Please note that since I know that my CPU will do 10000 increments of an integer way faster than a millisecond, I have changed it to microseconds. I've also declared the counter as volatile in hope that compiler won't optimize it away.
#include <ctime>
#include <chrono>
#include <iostream>
int main()
{
volatile int i = 0; // "volatile" is to ask compiler not to optimize the loop away.
auto start = std::chrono::steady_clock::now();
while (i < 10000) {
++i;
}
auto end = std::chrono::steady_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "It took me " << elapsed.count() << " microseconds." << std::endl;
}
When I compile and run it, it prints:
$ g++ -std=c++11 -Wall -o test ./test.cpp && ./test
It took me 23 microseconds.
Hope it helps. Good Luck!

At a glance, it seems like you are subtracting the larger value from the smaller value. You call:
diffclock( start, end );
But then diffclock is defined as:
double diffclock( clock_t clock1, clock_t clock2 ) {
double diffticks = clock1 - clock2;
double diffms = diffticks / ( CLOCKS_PER_SEC / 1000 );
return diffms;
}
Apart from that, it may have something to do with the way you are converting units. The use of 1000 to convert to milliseconds is different on this page:
http://en.cppreference.com/w/cpp/chrono/c/clock

The problem appears to be the loop is just too short. I tried it on my system and it gave 0 ticks. I checked what diffticks was and it was 0. Increasing the loop size to 100000000, so there was a noticeable time lag and I got -290 as output (bug -- I think that the diffticks should be clock2-clock1 so we should get 290 and not -290). I tried also changing "1000" to "1000.0" in the division and that didn't work.
Compiling with optimization does remove the loop, so you have to not use it, or make the loop "do something", e.g. increment a counter other than the loop counter in the loop body. At least that's what GCC does.

Note: This is available after c++11.
You can use std::chrono library.
std::chrono has two distinct objects. (timepoint and duration). Timepoint represents a point in time, and duration, as we already know the term represents an interval or a span of time.
This c++ library allows us to subtract two timepoints to get a duration of time passed in the interval. So you can set a starting point and a stopping point. Using functions you can also convert them into appropriate units.
Example using high_resolution_clock (which is one of the three clocks this library provides):
#include <chrono>
using namespace std::chrono;
//before running function
auto start = high_resolution_clock::now();
//after calling function
auto stop = high_resolution_clock::now();
Subtract stop and start timepoints and cast it into required units using the duration_cast() function. Predefined units are nanoseconds, microseconds, milliseconds, seconds, minutes, and hours.
auto duration = duration_cast<microseconds>(stop - start);
cout << duration.count() << endl;

First of all you should subtract end - start not vice versa.
Documentation says if value is not available clock() returns -1, did you check that?
What optimization level do you use when compile your program? If optimization is enabled compiler can effectively eliminate your loop entirely.

Measuring time to perform simple instruction

I am trying to measure the number of cycles it takes my CPU to perform a specific instruction (one that should take one CPU cycle), and the output must be in cycle-lengths (The time it takes the CPU to complete one cycle).
So first of all, my CPU is 2.1GHz, so that means that one cycle-length unit on my computer is 1/2100, right?
Also - I am using getTimeOfDay to measure time in microseconds and I calculate the average of 1,000,000 iterations.
So if I'm not mistaken my desired output must be result*2100 (in order to get it in cycle lengths). Am I right?
Thanks!
P.S Don't know if it matters, but I'm writing in cpp

I believe you have some been misinformed about a few things.
In modern terms clock speed is an indication of speed, not an actual measure of speed - so there is no reasonable way to estimate how long a single instruction may take.
Your question is based on the assumption that all instructions are equal - they most certainly aren't, some CPU instructions get interpreted as sequences of micro-instructions on some architectures, and on others the timings may change.
Additionally, you can't safely assume that on modern architectures that a repeated instruction will perform the same way, this depends on data and instruction caches, pipelines and branch prediction.
The resolution of getTimeOfDay isn't anything like accurate enough to estimate the length of time required to measure single instructions, even CPU clock cycle counters (TSC on x86) are not sufficient.
Furthermore, your operating system is a major source of error in estimation of such timings, context switches, power management, machine load and interrupts all have a huge impact. But even on a true hard real-time operating system (QNX or VxWorks), such measurements are still difficult, and require time and tools, as well as the expertise to interpret the results. On a General Purpose Operating System (windows or a basic Linux), you've got little to no hope of getting accurate measurements)
The computational overhead and error of reading and storing the CPU cycle counts will also tend to dwarf the time required for one instruction. At minimum, I suggest you consider grouping several hundred or thousand instructions together.
On deterministic architectures (1 cycle = 1 instruction) with no caches, like a PIC chip, you can do exactly what you suggest by using the clock multiplier, but even then to validate your measurements you would probably need a logic analyser (ie. you need to do this in hardware).
In short, this is an extremely hard problem.

The CPU contains a cycle counter which you can read with a bit of inline assembly:
static inline uint64_t get_cycles()
{
uint64_t n;
__asm__ __volatile__ ("rdtsc" : "=A"(n));
return n;
}
If you measure the cycle count for 1, 2 and 3 million iterations of your operation, you ought to be able to interpolate the cost of one, but be sure to also measure "empty" loops to remove the cost of the looping:
{
unsigned int n, m = get_cycles();
for (unsigned int n = 0; n != 1000000; ++n)
{
// (compiler barrier)
}
n = get_cycles();
// cost of loop: n - m
}
{
unsigned int n, m = get_cycles();
for (unsigned int n = 0; n != 1000000; ++n)
{
my_operation();
}
n = get_cycles();
// cost of 1000000 operations: n - m - cost of loop
}
// repeat for 2000000, 3000000.

I am trying to measure the time it takes my computer to perform a simple instruction
If that's the case, the key isn't even the most accurate time function you can find. I bet none have the resolution necessary to provide a meaningful result.
The key is to increase your sample count.
So instead of doing something like:
start = tik();
instruction();
end = tok();
time = end - start;
do
start = tik();
for ( 1..10000 )
instruction();
end = tok();
time = (end - start) / 10000;
This will provide more accurate results, and the error caused by the measuring mechanism will be negligeable.

Best way to test code speed in C++ without profiler, or does it not make sense to try?

On SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are quite a few issues involved and most Q & A ignore all but a few at a time, or don't justify their proposals.
What Im wondering about. If I have two functions that do the same thing, and Im curious about the difference in speed, does it make sense to test this without external tools, with timers, or will this compiled in testing affect the results to much?
I ask this because if it is sensible, as a C++ programmer, I want to know how it should best be done, as they are much simpler than using external tools. If it makes sense, lets proceed with all the possible pitfalls:
Consider this example. The following code shows 2 ways of doing the same thing:
#include <algorithm>
#include <ctime>
#include <iostream>
typedef unsigned char byte;
inline
void
swapBytes( void* in, size_t n )
{
for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi )
in[lo] ^= in[hi]
, in[hi] ^= in[lo]
, in[lo] ^= in[hi] ;
}
int
main()
{
byte arr[9] = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };
const int iterations = 100000000;
clock_t begin = clock();
for( int i=iterations; i!=0; --i )
swapBytes( arr, 8 );
clock_t middle = clock();
for( int i=iterations; i!=0; --i )
std::reverse( arr, arr+8 );
clock_t end = clock();
double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC;
double secReve = (double) ( end-middle ) / CLOCKS_PER_SEC;
std::cout << "swapBytes, for: " << iterations << " times takes: " << middle-begin
<< " clock ticks, which is: " << secSwap << "sec." << std::endl;
std::cout << "std::reverse, for: " << iterations << " times takes: " << end-middle
<< " clock ticks, which is: " << secReve << "sec." << std::endl;
std::cin.get();
return 0;
}
// Output:
// Release:
// swapBytes, for: 100000000 times takes: 3000 clock ticks, which is: 3sec.
// std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec.
// Debug:
// swapBytes, for: 10000000 times takes: 1781 clock ticks, which is: 1.781sec.
// std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec.
The issues:
Which timers to use and how get the cpu time actually consumed by the code under question?
What are the effects of compiler optimization (since these functions just swap bytes back and forth, the most efficient thing is obviously to do nothing at all)?
Considering the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If yes, can you explain how std::reverse gets to be so fast, considering the simplicity of the custom function. I don't have the source code from the vc++ version that I used for this test, but here is the implementation from GNU. It boils down to the function iter_swap, which is completely incomprehensible for me. Would this also be expected to run twice as fast as that custom function, and if so, why?
Contemplations:
It seems two high precision timers are being proposed: clock() and QueryPerformanceCounter (on windows). Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements. This page on the gnu c library seems to contradict that, but when I put a breakpoint in vc++, the debugged process gets a lot of clock ticks even though it was suspended (I have not tested under gnu). Am I missing alternative counters for this, or do we need at least special libraries or classes for this? If not, is clock good enough in this example or would there be a reason to use the QueryPerformanceCounter?
What can we know for certain without debugging, dissassembling and profiling tools? Is anything actually happening? Is the function call being inlined or not? When checking in the debugger, the bytes do actually get swapped, but I'd rather know from theory why, than from testing.
Thanks for any directions.
update
Thanks to a hint from tojas the swapBytes function now runs as fast as the std::reverse. I had failed to realize that the temporary copy in case of a byte must be only a register, and thus is very fast. Elegance can blind you.
inline
void
swapBytes( byte* in, size_t n )
{
byte t;
for( int i=0; i<7-i; ++i )
{
t = in[i];
in[i] = in[7-i];
in[7-i] = t;
}
}
Thanks to a tip from ChrisW I have found that on windows you can get the actual cpu time consumed by a (read:your) process trough Windows Management Instrumentation. This definitely looks more interesting than the high precision counter.

Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements.
I do two things, to ensure that wall-clock time and CPU time are approximately the same thing:
Test for a significant length of time, i.e. several seconds (e.g. by testing a loop of however many thousands of iterations)
Test when the machine is more or less relatively idle except for whatever I'm testing.
Alternatively if you want to measure only/more exactly the CPU time per thread, that's available as a performance counter (see e.g. perfmon.exe).
What can we know for certain without debugging, dissassembling and profiling tools?
Nearly nothing (except that I/O tends to be relatively slow).

To answer you main question, it "reverse" algorithm just swaps elements from the array and not operating on the elements of the array.

Use QueryPerformanceCounter on Windows if you need a high-resolution timing. The counter accuracy depends on the CPU but it can go up to per clock pulse. However, profiling in real world operations is always a better idea.

Is it safe to say you're asking two questions?
Which one is faster, and by how much?
And why is it faster?
For the first, you don't need high precision timers. All you need to do is run them "long enough" and measure with low precision timers. (I'm old-fashioned, my wristwatch has a stop-watch function, and it is entirely good enough.)
For the second, surely you can run the code under a debugger and single-step it at the instruction level. Since the basic operations are so simple, you will be able to easily see roughly how many instructions are required for the basic cycle.
Think simple. Performance is not a hard subject. Usually, people are trying to find problems, for which this is a simple approach.

(This answer is specific to Windows XP and the 32-bit VC++ compiler.)
The easiest thing for timing little bits of code is the time-stamp counter of the CPU. This is a 64-bit value, a count of the number of CPU cycles run so far, which is about as fine a resolution as you're going to get. The actual numbers you get aren't especially useful as they stand, but if you average out several runs of various competing approaches then you can compare them that way. The results are a bit noisy, but still valid for comparison purposes.
To read the time-stamp counter, use code like the following:
LARGE_INTEGER tsc;
__asm {
cpuid
rdtsc
mov tsc.LowPart,eax
mov tsc.HighPart,edx
}
(The cpuid instruction is there to ensure that there aren't any incomplete instructions waiting to complete.)
There are four things worth noting about this approach.
Firstly, because of the inline assembly language, it won't work as-is on MS's x64 compiler. (You'll have to create a .ASM file with a function in it. An exercise for the reader; I don't know the details.)
Secondly, to avoid problems with cycle counters not being in sync across different cores/threads/what have you, you may find it necessary to set your process's affinity so that it only runs on one specific execution unit. (Then again... you may not.)
Thirdly, you'll definitely want to check the generated assembly language to ensure that the compiler is generating roughly the code you expect. Watch out for bits of code being removed, functions being inlined, that sort of thing.
Finally, the results are rather noisy. The cycle counters count cycles spent on everything, including waiting for caches, time spent on running other processes, time spent in the OS itself, etc. Unfortunately, it's not possible (under Windows, at least) to time just your process. So, I suggest running the code under test a lot of times (several tens of thousands) and working out the average. This isn't very cunning, but it seems to have produced useful results for me at any rate.

I would suppose that anyone competent enough to answer all your questions is gong to be far too busy to answer all your questions. In practice it is probably more effective to ask a single, well-defined questions. That way you may hope to get well-defined answers which you can collect and be on your way to wisdom.
So, anyway, perhaps I can answer your question about which clock to use on Windows.
clock() is not considered a high precision clock. If you look at the value of CLOCKS_PER_SEC you will see it has a resolution of 1 millisecond. This is only adequate if you are timing very long routines, or a loop with 10000's of iterations. As you point out, if you try and repeat a simple method 10000's of times in order to get a time that can be measured with clock() the compiler is liable to step in and optimize the whole thing away.
So, really, the only clock to use is QueryPerformanceCounter()

Is there something you have against profilers? They help a ton. Since you are on WinXP, you should really give a trial of vtune a try. Try a call graph sampling test and look at self time and total time of the functions being called. There's no better way to tune your program so that it's the fastest possible without being an assembly genius (and a truly exceptional one).
Some people just seem to be allergic to profilers. I used to be one of those and thought I knew best about where my hotspots were. I was often correct about obvious algorithmic inefficiencies, but practically always incorrect about more micro-optimization cases. Just rewriting a function without changing any of the logic (ex: reordering things, putting exceptional case code in a separate, non-inlined function, etc) can make functions a dozen times faster and even the best disassembly experts usually can't predict that without the profiler.
As for relying on simplistic timing tests alone, they are extremely problematic. That current test is not so bad but it's a very common mistake to write timing tests in ways in which the optimizer will optimize out dead code and end up testing the time it takes to do essentially a nop or even nothing at all. You should have some knowledge to interpret the disassembly to make sure the compiler isn't doing this.
Also timing tests like this have a tendency to bias the results significantly since a lot of them just involve running your code over and over in the same loop, which tends to simply test the effect of your code when all the memory in the cache with all the branch prediction working perfectly for it. It's often just showing you best case scenarios without showing you the average, real-world case.
Depending on real world timing tests is a little bit better; something closer to what your application will be doing at a high level. It won't give you specifics about what is taking what amount of time, but that's precisely what the profiler is meant to do.

Wha? How to measure speed without a profiler? The very act of measuring speed is profiling! The question amounts to, "how can I write my own profiler?" And the answer is clearly, "don't".
Besides, you should be using std::swap in the first place, which complete invalidates this whole pointless pursuit.
-1 for pointlessness.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js