Measuring time to perform simple instruction

Measuring time to perform simple instruction - c++

I am trying to measure the number of cycles it takes my CPU to perform a specific instruction (one that should take one CPU cycle), and the output must be in cycle-lengths (The time it takes the CPU to complete one cycle).
So first of all, my CPU is 2.1GHz, so that means that one cycle-length unit on my computer is 1/2100, right?
Also - I am using getTimeOfDay to measure time in microseconds and I calculate the average of 1,000,000 iterations.
So if I'm not mistaken my desired output must be result*2100 (in order to get it in cycle lengths). Am I right?
Thanks!
P.S Don't know if it matters, but I'm writing in cpp

I believe you have some been misinformed about a few things.
In modern terms clock speed is an indication of speed, not an actual measure of speed - so there is no reasonable way to estimate how long a single instruction may take.
Your question is based on the assumption that all instructions are equal - they most certainly aren't, some CPU instructions get interpreted as sequences of micro-instructions on some architectures, and on others the timings may change.
Additionally, you can't safely assume that on modern architectures that a repeated instruction will perform the same way, this depends on data and instruction caches, pipelines and branch prediction.
The resolution of getTimeOfDay isn't anything like accurate enough to estimate the length of time required to measure single instructions, even CPU clock cycle counters (TSC on x86) are not sufficient.
Furthermore, your operating system is a major source of error in estimation of such timings, context switches, power management, machine load and interrupts all have a huge impact. But even on a true hard real-time operating system (QNX or VxWorks), such measurements are still difficult, and require time and tools, as well as the expertise to interpret the results. On a General Purpose Operating System (windows or a basic Linux), you've got little to no hope of getting accurate measurements)
The computational overhead and error of reading and storing the CPU cycle counts will also tend to dwarf the time required for one instruction. At minimum, I suggest you consider grouping several hundred or thousand instructions together.
On deterministic architectures (1 cycle = 1 instruction) with no caches, like a PIC chip, you can do exactly what you suggest by using the clock multiplier, but even then to validate your measurements you would probably need a logic analyser (ie. you need to do this in hardware).
In short, this is an extremely hard problem.

The CPU contains a cycle counter which you can read with a bit of inline assembly:
static inline uint64_t get_cycles()
{
uint64_t n;
__asm__ __volatile__ ("rdtsc" : "=A"(n));
return n;
}
If you measure the cycle count for 1, 2 and 3 million iterations of your operation, you ought to be able to interpolate the cost of one, but be sure to also measure "empty" loops to remove the cost of the looping:
{
unsigned int n, m = get_cycles();
for (unsigned int n = 0; n != 1000000; ++n)
{
// (compiler barrier)
}
n = get_cycles();
// cost of loop: n - m
}
{
unsigned int n, m = get_cycles();
for (unsigned int n = 0; n != 1000000; ++n)
{
my_operation();
}
n = get_cycles();
// cost of 1000000 operations: n - m - cost of loop
}
// repeat for 2000000, 3000000.

I am trying to measure the time it takes my computer to perform a simple instruction
If that's the case, the key isn't even the most accurate time function you can find. I bet none have the resolution necessary to provide a meaningful result.
The key is to increase your sample count.
So instead of doing something like:
start = tik();
instruction();
end = tok();
time = end - start;
do
start = tik();
for ( 1..10000 )
instruction();
end = tok();
time = (end - start) / 10000;
This will provide more accurate results, and the error caused by the measuring mechanism will be negligeable.

Related

What is faster in C++: mod (%) or another counter?

At the risk of this being a duplicate, maybe I just can't find a similar post right now:
I am writing in C++ (C++20 to be specific). I have a loop with a counter that counts up every turn. Let's call it counter. And if this counter reaches a page-limit (let's call it page_limit), the program should continue on the next page. So it looks something like this:
const size_t page_limit = 4942;
size_t counter = 0;
while (counter < foo) {
if (counter % page_limit == 0) {
// start new page
}
// some other code
counter += 1;
}
Now I am wondering since the counter goes pretty high: would the program run faster, if I wouldn't have the program calculate the modulo counter % page_limit every time, but instead make another counter? It could look something like this:
const size_t page_limit = 4942;
size_t counter = 0;
size_t page_counter = 4942;
while (counter < foo) {
if (page_counter == page_limit) {
// start new page
page_counter = 0;
}
// some other code
counter += 1;
page_counter += 1;
}

Most optimizing compilers will convert divide or modulo operations into multiply by pre-generated inverse constant and shift instructions if the divisor is a constant. Possibly also if the same divisor value is used repeatedly in a loop.
Modulo multiplies by inverse to get a quotient, then multiplies quotient by divisor to get a product, and then original number minus product will be the modulo.
Multiply and shift are fast instructions on reasonably recent X86 processors, but branch prediction can also reduce the time it takes for a conditional branch, so as suggested a benchmark may be needed to determine which is best.

(I assume you meant to write if(x%y==0) not if(x%y), to be equivalent to the counter.)
I don't think compilers will do this optimization for you, so it could be worth it. It's going to be smaller code-size, even if you can't measure a speed difference. The x % y == 0 way still branches (so is still subject to a branch misprediction those rare times when it's true). Its only advantage is that it doesn't need a separate counter variable, just some temporary registers at one point in the loop. But it does need the divisor every iteration.
Overall this should be better for code size, and isn't less readable if you're used to the idiom. (Especially if you use if(--page_count == 0) { page_count=page_limit; ... so all pieces of the logic are in two adjacent lines.)
If your page_limit were not a compile-time constant, this is even more likely to help. dec/jz that's only taken once per many decrements is a lot cheaper than div/test edx,edx/jz, including for front-end throughput. (div is micro-coded on Intel CPUs as about 10 uops, so even though it's one instruction it still takes up the front-end for multiple cycles, taking away throughput resources from getting surrounding code into the out-of-order back-end).
(With a constant divisor, it's still multiply, right shift, sub to get the quotient, then multiply and subtract to get the remainder from that. So still several single-uop instructions. Although there are some tricks for divisibility testing by small constants see #Cassio Neri's answer on Fast divisibility tests (by 2,3,4,5,.., 16)? which cites his journal articles; recent GCC may have started using these.)
But if your loop body doesn't bottleneck on front-end instruction/uop throughput (on x86), or the divider execution unit, then out-of-order exec can probably hide most of the cost of even a div instruction. It's not on the critical path so it could be mostly free if its latency happens in parallel with other computation, and there are spare throughput resources. (Branch prediction + speculative execution allow execution to continue without waiting for the branch condition to be known, and since this work is independent of other work it can "run ahead" as the compiler can see into future iterations.)
Still, making that work even cheaper can help the compiler see and handle a branch mispredict sooner. But modern CPUs with fast recovery can keep working on old instructions from before the branch while recovering. ( What exactly happens when a skylake CPU mispredicts a branch? / Avoid stalling pipeline by calculating conditional early )
And of course a few loops do fully keep the CPU's throughput resources busy, not bottlenecking on cache misses or a latency chain. And fewer uops executed per iteration is more friendly to the other hyperthread (or SMT in general).
Or if you care about your code running on in-order CPUs (common for ARM and other non-x86 ISAs that target low-power implementations), the real work has to wait for the branch-condition logic. (Only hardware prefetch or cache-miss loads and things like that can be doing useful work while running extra code to test the branch condition.)
Use a down-counter
Instead of counting up, you'd actually want to hand-hold the compiler into using a down-counter that can compile to dec reg / jz .new_page or similar; all normal ISAs can do that quite cheaply because it's the same kind of thing you'd find at the bottom of normal loops. (dec/jnz to keep looping while non-zero)
if(--page_counter == 0) {
/*new page*/;
page_counter = page_limit;
}
A down-counter is more efficient in asm and equally readable in C (compared to an up-counter), so if you're micro-optimizing you should write it that way. Related: using that technique in hand-written asm FizzBuzz. Maybe also a code review of asm sum of multiples of 3 and 5, but it does nothing for no-match so optimizing it is different.
Notice that page_limit is only accessed inside the if body, so if the compiler is low on registers it can easily spill that and only read it as needed, not tying up a register with it or with multiplier constants.
Or if it's a known constant, just a move-immediate instruction. (Most ISAs also have compare-immediate, but not all. e.g. MIPS and RISC-V only have compare-and-branch instructions that use the space in the instruction word for a target address, not for an immediate.) Many RISC ISAs have special support for efficiently setting a register to a wider constant than most instructions that take an immediate (like ARM movw with a 16-bit immediate, so 4092 can be done in one instruction more mov but not cmp: it doesn't fit in 12 bits).
Compared to dividing (or multiplicative inverse), most RISC ISAs don't have multiply-immediate, and a multiplicative inverse is usually wider than one immediate can hold. (x86 does have multiply-immediate, but not for the form that gives you a high-half.) Divide-immediate is even rarer, not even x86 has that at all, but no compiler would use that unless optimizing for space instead of speed if it did exist.
CISC ISAs like x86 can typically multiply or divide with a memory source operand, so if low on registers the compiler could keep the divisor in memory (especially if it's a runtime variable). Loading once per iteration (hitting in cache) is not expensive. But spilling and reloading an actual variable that changes inside the loop (like page_count) could introduce a store/reload latency bottleneck if the loop is short enough and there aren't enough registers. (Although that might not be plausible: if your loop body is big enough to need all the registers, it probably has enough latency to hide a store/reload.)

If somebody put it in front of me, I would rather it was:
const size_t page_limit = 4942;
size_t npages = 0, nitems = 0;
size_t pagelim = foo / page_limit;
size_t resid = foo % page_limit;
while (npages < pagelim || nitems < resid) {
if (++nitems == page_limit) {
/* start new page */
nitems = 0;
npages++;
}
}
Because the program is now expressing the intent of the processing -- a bunch of things in page_limit sized chunks; rather than an attempt to optimize away an operation.
That the compiler might generate nicer code is just a blessing.

Measuing CPU clock speed

I am trying to measure the speed of the CPU.I am not sure how much my method is accurate. Basicly, I tried an empty for loop with values like UINT_MAX but the program terminated quickly so I tried UINT_MAX * 3 and so on...
Then I realized that the compiler is optimizing away the loop, so I added a volatile variable to prevent optimization. The following program takes 1.5 seconds approximately to finish. I want to know how accurate is this algorithm for measuring the clock speed. Also,how do I know how many core's are being involved in the process?
#include <iostream>
#include <limits.h>
#include <time.h>
using namespace std;
int main(void)
{
volatile int v_obj = 0;
unsigned long A, B = 0, C = UINT32_MAX;
clock_t t1, t2;
t1 = clock();
for (A = 0; A < C; A++) {
(void)v_obj;
}
t2 = clock();
std::cout << (double)(t2 - t1) / CLOCKS_PER_SEC << std::endl;
double t = (double)(t2 - t1) / CLOCKS_PER_SEC;
unsigned long clock_speed = (unsigned long)(C / t);
std::cout << "Clock speed : " << clock_speed << std::endl;
return 0;
}

This doesn't measure clock speed at all, it measures how many loop iterations can be done per second. There's no rule that says one iteration will run per clock cycle. It may be the case, and you may have actually found it to be the case - certainly with optimized code and a reasonable CPU, a useless loop shouldn't run much slower than that. It could run at half speed though, some processors are not able to retire more than 1 taken branch every 2 cycles. And on esoteric targets, all bets are off.
So no, this doesn't measure clock cycles, except accidentally. In general it's extremely hard to get an empirical clock speed (you can ask your OS what it thinks the maximum clock speed and current clock speed are, see below), because
If you measure how much wall clock time a loop takes, you must know (at least approximately) the number of cycles per iteration. That's a bad enough problem in assembly, requiring fairly detailed knowledge of the expected microarchitectures (maybe a long chain of dependent instructions that each could only reasonably take 1 cycle, like add eax, 1? a long enough chain that differences in the test/branch throughput become small enough to ignore), so obviously anything you do there is not portable and will have assumptions built into it may become false (actually there is an other answer on SO that does this and assumes that addps has a latency of 3, which it doesn't anymore on Skylake, and didn't have on old AMDs). In C? Give up now. The compiler might be rolling some random code generator, and relying on it to be reasonable is like doing the same with a bear. Guessing the number of cycles per iteration of code you neither control nor even know is just folly. If it's just on your own machine you can check the code, but then you could just check the clock speed manually too so..
If you measure the number of clock cycles elapsed in a given amount of wall clock time.. but this is tricky. Because rdtsc doesn't measure clock cycles (not anymore), and nothing else gets any closer. You can measure something, but with frequency scaling and turbo, it generally won't be actual clock cycles. You can get actual clock cycles from a performance counter, but you can't do that from user mode. Obviously any way you try to do this is not portable, because you can't portably ask for the number of elapsed clock cycles.
So if you're doing this for actual information and not just to mess around, you should probably just ask the OS. For Windows, query WMI for CurrentClockSpeed or MaxClockSpeed, whichever one you want. On Linux there's stuff in /proc/cpuinfo. Still not portable, but then, no solution is.
As for
how do I know how many core's are being involved in the process?
1. Of course your thread may migrate between cores, but since you only have one thread, it's on only one core at any time.

A good optimizer may remove the loop, since
for (A = 0; A < C; A++) {
(void)v_obj;
}
has the same effect on the program state as;
A = C;
So the optimizer is entirely free to unwind your loop.
So you cannot measure CPU speed this way as it depends on the compiler as much as it does on the computer (not to mention the variable clock speed and multicore architecture already mentioned)

Cost of OpenMPI in C++

I have the following program C++ program which uses no communication, and the same identical work is done on all cores, I know that this doesn't use parallel processing at all:
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
I'm running this program in a single node with two
Intel® Xeon® Processor E5-2690 v3, so I have 24 cores all together. This is a dedicated node, no one else is using it.
Since there is no communication, and each processor is doing the same amount of (identical) work, running it on multiple processors should give the same time. However, I get the following times (averaged time over all cores):
1 core: 0.237
2 cores: 0.240
4 cores: 0.241
8 cores: 0.261
16 cores: 0.454
What could cause the increase in time? Particularly for 16 cores.
I have ran callgrind and I get the roughly same amount of data/instruction misses on all cores (the percentage of misses are the same).
I have repeated the same test on a node with two Intel® Xeon® Processor E5-2628L v2, (16 cores all together), I observe the same increase in execution times. Is this something to do with the MPI implementation?

Considering you are using ~2 GiB of memory per rank, your code is memory-bound. Except for prefetchers you are not operating within the cache but in main memory. You are simply hitting the memory bandwidth at a certain number of active cores.
Another aspect can be turbo mode, if enabled. Turbo mode can increase the core frequency to higher levels if less cores are utilized. As long as the memory bandwidth is not saturated, the higher frequency from turbo core will increase the bandwidth each core gets. This paper discusses the available aggregate memory bandwidth on Haswell processors depending on number of active cores and frequency (Fig 7./8.)
Please note that this has nothing to do with MPI / OpenMPI. You might as well launch the same program X times via any other mean.

I suspect that there are common resources that should be used by your program, so when the number of them increases, there are delays, so that a resource is free'ed so that it can be used by the other process.
You see, you may have 24 cores, but that doesn't mean that all your system allows every core to do everything concurrent. As mentioned in the comments, the memory access is one thing that might cause delays (due to traffic), same thing for disk.
Also consider the interconnection network, which can also suffer from many accesses. In conclusion, notice that these hardware delays are enough to overwhelm the processing time.
General note: Remember how Efficiency of a program is defined:
E = S/p, where S is the speedup and p the number of nodes/processes/threads
Now take Scalability into account. Usually programs are weakly scalable, i.e. that you have to increase with the same rate the size of the problem and p. By increasing only the number of p, while keeping the size of your problem (n in your case) constant, while keeping Efficiency constant, yields a strongly Scalable program.

Your program is not using parallel processing at all. Just because you have compiled it with OpenMP does not make it parallel.
To parallelize the for loop, for example, you need to use the different #pragma's OpenMP offer.
unsigned n = 130000000;
std::vector<double>vec1(n,1.0);
std::vector<double>vec2(n,1.0);
double precision :: t1,t2,dt;
t1 = MPI_Wtime();
#pragma omp parallel for
for (unsigned i = 0; i < n; i++)
{
// Do something so it's not a trivial loop
vec1[i] = vec2[i]+i;
}
t2 = MPI_Wtime();
dt = t2-t1;
However, take into account that for large values of n, the impact of cache misses may hide the perfomance gained with multiple cores.

How to reach AVX computation throughput for a simple loop

Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method.
The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example:
#pragma ivdep
for(int ii=0;ii<Large_Number;ii++)
{ Z[ii] = C1*Z[ii] + C2*D[ii];}
Large_Number is about 1,000,000, but not bigger than 10,000,000
I have tried to manually unroll the loop and write AVX code but failed to make it faster:
int Vec_Size = 8;
int Unroll_Num = 6;
int remainder = Large_Number%(Vec_Size*Unroll_Num);
int iter = Large_Number/(Vec_Size*Unroll_Num);
int addr_incr = Vec_Size*Unroll_Num;
__m256 AVX_Div1, AVX_Div2, AVX_Div3, AVX_Div4, AVX_Div5, AVX_Div6;
__m256 AVX_Z1, AVX_Z2, AVX_Z3, AVX_Z4, AVX_Z5, AVX_Z6;
__m256 AVX_Zb = _mm256_set1_ps(Zb);
__m256 AVX_Za = _mm256_set1_ps(Za);
for(int it=0;it<iter;it++)
{
int addr = addr + addr_incr;
AVX_Div1 = _mm256_loadu_ps(&Div1[addr]);
AVX_Z1 = _mm256_loadu_ps(&Z[addr]);
AVX_Z1 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div1),_mm256_mul_ps(AVX_Za,AVX_Z1));
_mm256_storeu_ps(&Z[addr],AVX_Z1);
AVX_Div2 = _mm256_loadu_ps(&Div1[addr+8]);
AVX_Z2 = _mm256_loadu_ps(&Z[addr+8]);
AVX_Z2 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div2),_mm256_mul_ps(AVX_Za,AVX_Z2));
_mm256_storeu_ps(&Z[addr+8],AVX_Z2);
AVX_Div3 = _mm256_loadu_ps(&Div1[addr+16]);
AVX_Z3 = _mm256_loadu_ps(&Z[addr+16]);
AVX_Z3 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div3),_mm256_mul_ps(AVX_Za,AVX_Z3));
_mm256_storeu_ps(&Z[addr+16],AVX_Z3);
AVX_Div4 = _mm256_loadu_ps(&Div1[addr+24]);
AVX_Z4 = _mm256_loadu_ps(&Z[addr+24]);
AVX_Z4 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div4),_mm256_mul_ps(AVX_Za,AVX_Z4));
_mm256_storeu_ps(&Z[addr+24],AVX_Z4);
AVX_Div5 = _mm256_loadu_ps(&Div1[addr+32]);
AVX_Z5 = _mm256_loadu_ps(&Z[addr+32]);
AVX_Z5 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div5),_mm256_mul_ps(AVX_Za,AVX_Z5));
_mm256_storeu_ps(&Z[addr+32],AVX_Z5);
AVX_Div6 = _mm256_loadu_ps(&Div1[addr+40]);
AVX_Z6 = _mm256_loadu_ps(&Z[addr+40]);
AVX_Z6 = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div6),_mm256_mul_ps(AVX_Za,AVX_Z6));
_mm256_storeu_ps(&Z[addr+40],AVX_Z6);
}
The above AVX loop is actually a bit slower than the Inter compiler generated code.
The compiler generated code can reach about 8G flops/s, about 25% of the single thread theoretical throughput of a 3GHz Ivybridge processor. I wonder if it is even possible to reach the throughput for the simple loop like this.
Thank you!

Improving performance for the codes like yours is "well explored" and still popular area. Take a look at dot-product (perfect link provided by Z Boson already) or at some (D)AXPY optimization discussions (https://scicomp.stackexchange.com/questions/1932/are-daxpy-dcopy-dscal-overkills)
In general , key topics to explore and consider applying are:
AVX2 advantage over AVX due to FMA and better load/store ports u-architecture on Haswell
Pre-Fetching. "Streaming stores", "non-temporal stores" for some platforms.
Threading parallelism to reach max sustained bandwidth
Unrolling (already done by you; Intel Compiler is also capable to do that with #pragma unroll (X) ). Not a big difference for "streaming" codes.
Finally deciding what is a set of hardware platforms you want to optimize your code for
Last bullet is particularly important, because for "streaming" and overall memory-bound codes - it's important to know more about target memory-sybsystems; for example, with existing and especially future high-end HPC servers (2nd gen Xeon Phi codenamed Knights Landing as an example) you may have very different "roofline model" balance between bandwidth and compute, and even different techniques than in case of optimizing for average desktop machine.

Are you sure that 8 GFLOPS/s is about 25% of the maximum throughput of a 3 GHz Ivybridge processor? Let's do the calculations.
Every 8 elements require two single-precision AVX multiplications and one AVX addition. An Ivybridge processor can only execute one 8-wide AVX addition and one 8-wide AVX multiplication per cycle. Also since the addition is dependent on the two multiplications, then 3 cycles are required to process 8 elements. Since the addition can be overlapped with the next multiplication, we can reduce this to 2 cycles per 8 elements. For one billion elements, 2*10^9/8 = 10^9/4 cycles are required. Considering 3 GHz clock, we get 10^9/4 * 10^-9/3 = 1/12 = 0.08 seconds. So the maximum theoretical throughput is 12 GLOPS/s and the compiler-generated code is reaching 66%, which is fine.
One more thing, by unrolling the loop 8 times, it can be vectorized efficiently. I doubt that you'll gain any significant speed up if you unroll this particular loop more than that, especially more than 16 times.

I think the real bottleneck is that there are 2 load and 1 store instructions for every 2 multiplication and 1 addition. Maybe the calculation is memory bandwidth limited. Every element requires transfer 12 bytes of data, and if 2G elements are processed every second (which is 6G flops) that is 24GB/s data transfer, reaching the theoretical bandwidth of ivy bridge. I wonder if this argument holds and there is indeed no solution to this problem.
The reason why I am answering to my own question is to hope someone can correct me before I easily give up the optimization. This simple loop is EXTREMELY important for many scientific solvers, it is the backbone of finite element and finite difference method. If one cannot even feed one processor because the computation is memory bandwith limited, why bother with multicore? A high bandwith GPU or Xeon Phi should be better solutions.

Correct QueryPerformanceCounter Function implementation / Time changes everytime

I have to create a sorting algorithm function that returns number of comparisons, number of copies and number of MICROSECONDS it uses to finish its sorting.
I have seen that to use microseconds i have to use the function QueryPerformance counter as it's accurate (Ps i know it isn't portable between OS)
So i've done that :
void Exchange_sort(int vect[], int dim, int &countconf, int &countcopy, double &time)
{
LARGE_INTEGER a, b, oh, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&a);
QueryPerformanceCounter(&b);
oh.QuadPart = b.QuadPart - a.QuadPart; //Saves in oh the overhead time (?) accuracy
QueryPerformanceCounter(&a);
int i=0,j=0; // The sorting algorithm starts
for (i=0 ; i<dim-1 ; i++)
{ for(j=i+1 ; j<dim; j++ )
{
countconf++; // +1 Comparisons
if (vect[i]>vect[j])
{
scambio ( vect[i],vect[j] ); // It is a function that swaps 2 integers
countcopy=countcopy+3; // +3 copies
}
}
}
QueryPerformanceCounter(&b); // Ends timer
time = ( ( (double)(b.QuadPart - a.QuadPart - oh.QuadPart) /freq.QuadPart )
*1000000 ) ;
}
The *1000000 is actually to give microseconds...
I think like this it should work but everytime i call the function giving it the same dimension of the array, it returns a different time... How can i solve that?
Thank you very much, and sorry for my bad coding

Firstly, the performance counter frequency might not be that great. It's usually several hundred thousand or more, which gives a microsecond or tens of microseconds resolution, but you should be aware that it can be even worse.
Secondly, if your array size is small, your sort might finish in nanoseconds or microseconds, and you would not be able to measure that accurately with QueryPerformanceCounter.
Thirdly, when your benchmark process is running, Windows might take the CPU away from it for a (relatively) long time, milliseconds or maybe even hundreds of milliseconds. This will lead to highly irregular and seemingly erratic timings.
I have two suggestions that you might pursue independently of each other:
I suggest you investigate using the RDTSC instruction (using inline assembly or compiler intrinsics or even an existing library.) Which will most likely give you better resolution with far less overhead. But I have to warn you that it has its own bag of problems.
For this type of benchmark, you have to run your sort routine with the exact same input many times (tens or hundreds) and then take the smallest time measurement. The reason that you should adopt this strategy is that there are a few phenomena that will interfere with your timing and make it longer, but there is nothing that can make your sort go faster than it would on paper. Therefore, you need to run the test many many times and hope to all your gods that the fastest time you've measured is the actual running time with no interference or noise.
UPDATE: Reading through the comments on the question, it seems that you are trying to time a very short-running piece of code with a timer that doesn't have enough resolution. Either increase your input size, or use RDTSC.

The short answer for your question is that it is not possible to measure exactly the same time for all calls of the same function.
The fact that you are receiving different times is expected because your operating system is not a perfect Real-Time System, but a general purpose OS with multiple processes running at the same time and competing to be scheduled by the kernel to get its own CPU cycles.
And also, consider that, each time you execute your program or function, some of its instructions might be located at the RAM, and some might be available at the CPU L1 or L2 cache memory, and it will probably change from one execution to another. So, there are lots of variables to consider when evaluating the elapsed time for function calls using high level of precision.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js