How to avoid compiler optimizing some operation?
For example, if i implement my own sprintf2, i want to compare the performance of my sprintf2 and stdlib's sprintf, so i wrote this code:
#include<iostream>
#include<string>
#include<ctime>
using namespace std;
int main()
{
char c[50];
double d=-2.532343e+23;
int MAXN=1e8;
time_t t1,t2,t3;
t1=clock();
for(int i=0;i<MAXN;i++)
sprintf2(c,"%16.2e",d);//my own implemention of sprintf
t2=clock();
for(int i=0;i<MAXN;i++)
sprintf(c,"%16.2e",d);
t3=clock();
printf("sprintf2:%dms\nsprintf:%dms\n",t2-t1,t3-t2);
return 0;
}
It turns out:
sprintf2:523538ms//something big, i forgot
sprintf:0ms
As we know, sprintf costs time, and MAXN is so big, so t3-t2 shouldn't be 0.
As we don't use array c, and each time d is the same, so i guess compiler optimized it and sprintf only did once.
So here is the question, how can i measure the real time that 1e8sprintf cost?
The compiler optimized the calls to sprintf because you did not use the result, and because it is printing always the same number. So change also the printed number (since if you call the same sprintf in a loop the compiler is allowed to optimize and move the sprintf before the loop)
So just use the result, e.g. by computing a (meaningless) sum of some of the characters.
int s=0;
memset(c, 0, sizeof(c));
for(int i=0;i<MAXN;i++) {
sprintf2(c,"%16.2e",d+i*1.0e-9);
s+=c[i%8];
};
t2=clock();
for(int i=0;i<MAXN;i++) {
sprintf(c,"%16.2e",d+i*1.0e-9);
s+=c[i%8];
}
t3=clock();
printf("sprintf2:%dms\nsprintf:%dms\ns=%d\n",t2-t1,t3-t2,s);
t3=clock();
then you should be able to benchmark and to compile. You probably want to display the time cost of every call:
printf("sprintf2:%f ms\nsprintf:%f ms\n",
1.0e3*(t2-t1)/(double)maxn, 1.0e3*(t3-t2)/(double)maxn);
since POSIX requires that CLOCKS_PER_SEC equals 1000000, so a clock tick is one microsecond.
BTW, MAXN (which should be spelt in lower cases, all uppercases is conventionally for macros!) could be some input (otherwise a clever optimizing compiler could unroll the loop at compile time), e.g.
int main(int argc, char**argv) {
int maxn = argc>1 ? atoi(argv[1]) : 1000000;
Notice that when you are benchmarking, you really should ask the compiler to optimize with -O2. Measuring the speed of unoptimized code is meaningless.
And you can always look at the assembler code (e.g. gcc -O2 -fverbose-asm -S) and check that sprintf2 and sprintf are indeed called in a loop.
BTW on my Linux Debian/Sid/x86-64 i7 3770K desktop:
/// file b.c
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char**argv) {
int s=0;
char buf[50];
memset(buf, 0, sizeof(buf));
int maxn = (argc>1) ? atoi(argv[1]) : 1000000;
clock_t t1 = clock();
for (int i=0; i<maxn; i++) {
snprintf(buf, sizeof(buf), "%12.3f",
123.45678+(i*0.01)*(i%117));
s += buf[i%8];
};
clock_t t2 = clock();
printf ("maxn=%d s=%d deltat=%.3f sec, each iter=%.3f µsec\n",
maxn, s, (t2-t1)*1.0e-6, ((double)(t2-t1))/maxn);
return 0;
}
compiled as gcc -std=c99 -Wall -O3 b.c -o b (GCC is 4.9.2, Glibc is 2.19) gives the following consistent timings:
% time ./b 4000000
maxn=4000000 s=191871388 deltat=2.180 sec, each iter=0.545 µsec
./b 4000000 2.18s user 0.00s system 99% cpu 2.184 total
% time ./b 7000000
maxn=7000000 s=339696631 deltat=3.712 sec, each iter=0.530 µsec
./b 7000000 3.71s user 0.00s system 99% cpu 3.718 total
% time ./b 6000000
maxn=6000000 s=290285020 deltat=3.198 sec, each iter=0.533 µsec
./b 6000000 3.20s user 0.00s system 99% cpu 3.203 total
% time ./b 6000000
maxn=6000000 s=290285020 deltat=3.202 sec, each iter=0.534 µsec
./b 6000000 3.20s user 0.00s system 99% cpu 3.207 total
BTW, see this regarding Windows clock implementation (which might be perceived as buggy). You might be as happy as I am with installing and using Linux on your machine (I never used Windows, but I am using Unix or POSIX like systems since 1987).
At least in GCC the optimisation is stated in the documentation as not even turned on by default
Most optimizations are only enabled if an -O level is set on the command line. Otherwise they are disabled, even if individual optimization flags are specified.
As you can read here
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
But I can't share this impression.
So if by not specifing an -O parameter (or for MSVC you can just set the optimisation level in the properties, I remember there was a flag "no optimisation") not the expected behaving takes place, I would say, there is no way for turning off the optimisations in a way you want it.
But remember, the compiler is doing a lot of optimisation stuff, where you can't even directly do in the code. So there isn't even a reason for "turning off everything" if it is that what you are interested in.
So by documentation the latter seems to be not possible.
Related
I have a simple code:
#include <iostream>
#include <chrono>
int main(int argc, char ** argv)
{
int I=0;
double time=0.0;
for(int i=0; i<10; ++i)
{
auto begin1=std::chrono::steady_clock::now();
#pragma omp parallel for simd
for(int j=0; j<1000000; ++j) I=j;
auto end1=std::chrono::steady_clock::now();
auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1-begin1).count();
std::cout<<"time 1:"<<time<<std::endl;
time+=timei;
std::cout<<"time 2:"<<time<<std::endl;
}
return 0;
}
Use g++ 5.3.1 and compile line:
cmake . -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS="-O2 -fopenmp"
But the output is:
time 1:0
time 2:11
time 1:11
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
time 1:16
time 2:16
You see, i can't measure the execution time properly using std::chrono!
Why? What is going on? How to measure the execution time?
This is for "-O2" and "-O1" compiler optimization flags. With "-O0" everything works properly. Why?
The same situation is when i use Intel compiler icpc 19.0.1.144 and the compile line:
cmake .-DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-march=native -mtune=native -O2 -ipo16 -mcmodel=large"
If to use "-O2" and "-O1" compiler flags, the time is not measured properly, but if to replace them with"-O0", std::chrono works properly.
Speaking frankly, I am shocked.
But the question is the same: Why execution time measurement using std::chrono does not work properly here with "-O1" and "-O2" and works with "O0"? And how to properly measure the execution time in this piece of code?
Let me, please, update the code sample:
#include <iostream>
#include <chrono>
#include <ctime>
#include <omp.h>
int array[10000000]{0};
int main(int argc, char ** argv)
{
clock_t t;
double time=0.0;
for(int i=0; i<10; ++i)
{
auto begin1=std::chrono::steady_clock::now();
t=clock();
#pragma omp parallel for simd
for(int j=0; j<1000000; ++j) array[j]=j;
auto end1=std::chrono::steady_clock::now();
auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1-begin1).count();
std::cout<<"time 1:"<<time<<std::endl;
time+=timei;
std::cout<<"time 2:"<<time<<std::endl;
t=clock()-t;
printf(\nt%i=%f\n", i, (double)t/CLOCKS_PER_SEC);
}
return 0;
}
Now the std:chrono timer updates properly. But sometimes the results of std::clock and std::chrono differ significantly. Suppose, std::chrono is more accurate and its timing should be used.
So, as #Hamza, answered below, the compiler simply threw away the blocks of code which did nothing. But both the Intel and g++ compilers did not warn me about anything.
So, for the future, do not write for loops which do nothing. The compiler may simply throw away the piece of code which has no effect.
In my full code i tried to compare the relative performance of 2 functions, returning the same result: 1 - returned the value interpolating a table and 2 - calculating it from a formula (formula is an approximation of the table points). My error was that i wrote the results in the inner loop in temporary stack variables, simply did nothing. The compiler threw it away. I should write the values in the inner loop into an array, or accumulate them in any other way, simply, do something useful, what the compiler will not throw away.
That is how i understood it.
My guess is that the compiler simply optimizes away what you're doing in your loop since it decides it is useless.
With the following code you get some actual ms:
#include <iostream>
#include <chrono>
#include <omp.h>
int main(int argc, char ** argv)
{
int I=0;
double time=0.0;
for(int i=0; i<10; ++i)
{
auto begin1=std::chrono::steady_clock::now();
#pragma omp parallel for simd
for(int j=0; j<100000000; ++j) I+=j;
auto end1=std::chrono::steady_clock::now();
auto timei=std::chrono::duration_cast<std::chrono::milliseconds>(end1- begin1).count();
std::cout << I << std::endl;
std::cout<<"time 1:"<<time<<std::endl;
time+=timei;
std::cout<<"time 2:"<<time<<std::endl;
}
return 0;
}
I get the following output:
887459712
time 1:0
time 2:71
1774919424
time 1:71
time 2:142
-1632588160
time 1:142
time 2:213
-745128448
time 1:213
time 2:283
142331264
time 1:283
time 2:351
1029790976
time 1:351
time 2:419
1917250688
time 1:419
time 2:487
-1490256896
time 1:487
time 2:555
-602797184
time 1:555
time 2:623
284662528
time 1:623
time 2:692
In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
}
return c0;
}
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
}
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
free(counts);
free(buf);
return 0;
}
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory
Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.
So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.
Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)
Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.
You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.
clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.
Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.
For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)
But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.
I am trying to calculate the number of ticks a function uses to run and to do so using the clock() function like so:
unsigned long time = clock();
myfunction();
unsigned long time2 = clock() - time;
printf("time elapsed : %lu",time2);
But the problem is that the value it returns is a multiple of 10000, which I think is the CLOCK_PER_SECOND. Is there a way or an equivalent function value that is more precise?
I am using Ubuntu 64-bit, but would prefer if the solution can work on other systems like Windows & Mac OS.
There are a number of more accurate timers in POSIX.
gettimeofday() - officially obsolescent, but very widely available; microsecond resolution.
clock_gettime() - the replacement for gettimeofday() (but not necessarily so widely available; on an old version of Solaris, requires -lposix4 to link), with nanosecond resolution.
There are other sub-second timers of greater or lesser antiquity, portability, and resolution, including:
ftime() - millisecond resolution (marked 'legacy' in POSIX 2004; not in POSIX 2008).
clock() - which you already know about. Note that it measures CPU time, not elapsed (wall clock) time.
times() - CLK_TCK or HZ. Note that this measures CPU time for parent and child processes.
Do not use ftime() or times() unless there is nothing better. The ultimate fallback, but not meeting your immediate requirements, is
time() - one second resolution.
The clock() function reports in units of CLOCKS_PER_SEC, which is required to be 1,000,000 by POSIX, but the increment may happen less frequently (100 times per second was one common frequency). The return value must be divided by CLOCKS_PER_SEC to get time in seconds.
The most precise (but highly not portable) way to measure time is to count CPU ticks.
For instance on x86
unsigned long long int asmx86Time ()
{
unsigned long long int realTimeClock = 0;
asm volatile ( "rdtsc\n\t"
"salq $32, %%rdx\n\t"
"orq %%rdx, %%rax\n\t"
"movq %%rax, %0"
: "=r" ( realTimeClock )
: /* no inputs */
: "%rax", "%rdx" );
return realTimeClock;
}
double cpuFreq ()
{
ifstream file ( "/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq" );
string sFreq; if ( file ) file >> sFreq;
stringstream ssFreq ( sFreq ); double freq = 0.;
if ( ssFreq ) { ssFreq >> freq; freq *= 1000; } // kHz to Hz
return freq;
}
// Timing
unsigned long long int asmStart = asmx86Time ();
doStuff ();
unsigned long long int asmStop = asmx86Time ();
float asmDuration = ( asmStop - asmStart ) / cpuFreq ();
If you don't have an x86, you'll have to re-write the assembler code accordingly to your CPU. If you need maximum precision, that's unfortunatelly the only way to go... otherwise use clock_gettime().
Per the clock() manpage, on POSIX platforms the value of the CLOCKS_PER_SEC macro must be 1000000. As you say that the return value you're getting from clock() is a multiple of 10000, that would imply that the resolution is 10 ms.
Also note that clock() on Linux returns an approximation of the processor time used by the program. On Linux, again, scheduler statistics are updated when the scheduler runs, at CONFIG_HZ frequency. So if the periodic timer tick is 100 Hz, you get process CPU time consumption statistics with 10 ms resolution.
Walltime measurements are not bound by this, and can be much more accurate. clock_gettime(CLOCK_MONOTONIC, ...) on a modern Linux system provides nanosecond resolution.
I agree with the solution of Jonathan. Here is the implementation of clock_gettime() with nanoseconds of precision.
//Import
#define _XOPEN_SOURCE 500
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <sys/time.h>
int main(int argc, char *argv[])
{
struct timespec ts;
int ret;
while(1)
{
ret = clock_gettime (CLOCK_MONOTONIC, &ts);
if (ret)
{
perror ("clock_gettime");
return;
}
ts.tv_nsec += 20000; //goto sleep for 20000 n
printf("Print before sleep tid%ld %ld\n",ts.tv_sec,ts.tv_nsec );
// printf("going to sleep tid%d\n",turn );
ret = clock_nanosleep (CLOCK_MONOTONIC, TIMER_ABSTIME,&ts, NULL);
}
}
Although It's difficult to achieve ns precision, but this can be used to get precision for less than a microseconds (700-900 ns). printf above is used to just print the thread # (it'll definitely take 2-3 micro seconds to just print a statement).
I was given the following HomeWork assignment,
Write a program to test on your computer how long it takes to do
nlogn, n2, n5, 2n, and n! additions for n=5, 10, 15, 20.
I have written a piece of code but all the time I am getting the time of execution 0. Can anyone help me out with it? Thanks
#include <iostream>
#include <cmath>
#include <ctime>
using namespace std;
int main()
{
float n=20;
time_t start, end, diff;
start = time (NULL);
cout<<(n*log(n))*(n*n)*(pow(n,5))*(pow(2,n))<<endl;
end= time(NULL);
diff = difftime (end,start);
cout <<diff<<endl;
return 0;
}
better than time() with second-precision is to use a milliseconds precision.
a portable way is e.g.
int main(){
clock_t start, end;
double msecs;
start = clock();
/* any stuff here ... */
end = clock();
msecs = ((double) (end - start)) * 1000 / CLOCKS_PER_SEC;
return 0;
}
Execute each calculation thousands of times, in a loop, so that you can overcome the low resolution of time and obtain meaningful results. Remember to divide by the number of iterations when reporting results.
This is not particularly accurate but that probably does not matter for this assignment.
At least on Unix-like systems, time() only gives you 1-second granularity, so it's not useful for timing things that take a very short amount of time (unless you execute them many times in a loop). Take a look at the gettimeofday() function, which gives you the current time with microsecond resolution. Or consider using clock(), which measure CPU time rather than wall-clock time.
Your code is executed too fast to be detected by time function returning the number of seconds elapsed since 00:00 hours, Jan 1, 1970 UTC.
Try to use this piece of code:
inline long getCurrentTime() {
timeb timebstr;
ftime( &timebstr );
return (long)(timebstr.time)*1000 + timebstr.millitm;
}
To use it you have to include sys/timeb.h.
Actually the better practice is to repeat your calculations in the loop to get more precise results.
You will probably have to find a more precise platform-specific timer such as the Windows High Performance Timer. You may also (very likely) find that your compiler optimizes or removes almost all of your code.
I was given the following HomeWork assignment,
Write a program to test on your computer how long it takes to do
nlogn, n2, n5, 2n, and n! additions for n=5, 10, 15, 20.
I have written a piece of code but all the time I am getting the time of execution 0. Can anyone help me out with it? Thanks
#include <iostream>
#include <cmath>
#include <ctime>
using namespace std;
int main()
{
float n=20;
time_t start, end, diff;
start = time (NULL);
cout<<(n*log(n))*(n*n)*(pow(n,5))*(pow(2,n))<<endl;
end= time(NULL);
diff = difftime (end,start);
cout <<diff<<endl;
return 0;
}
better than time() with second-precision is to use a milliseconds precision.
a portable way is e.g.
int main(){
clock_t start, end;
double msecs;
start = clock();
/* any stuff here ... */
end = clock();
msecs = ((double) (end - start)) * 1000 / CLOCKS_PER_SEC;
return 0;
}
Execute each calculation thousands of times, in a loop, so that you can overcome the low resolution of time and obtain meaningful results. Remember to divide by the number of iterations when reporting results.
This is not particularly accurate but that probably does not matter for this assignment.
At least on Unix-like systems, time() only gives you 1-second granularity, so it's not useful for timing things that take a very short amount of time (unless you execute them many times in a loop). Take a look at the gettimeofday() function, which gives you the current time with microsecond resolution. Or consider using clock(), which measure CPU time rather than wall-clock time.
Your code is executed too fast to be detected by time function returning the number of seconds elapsed since 00:00 hours, Jan 1, 1970 UTC.
Try to use this piece of code:
inline long getCurrentTime() {
timeb timebstr;
ftime( &timebstr );
return (long)(timebstr.time)*1000 + timebstr.millitm;
}
To use it you have to include sys/timeb.h.
Actually the better practice is to repeat your calculations in the loop to get more precise results.
You will probably have to find a more precise platform-specific timer such as the Windows High Performance Timer. You may also (very likely) find that your compiler optimizes or removes almost all of your code.