I'm using the following code snippet to measure my algorithms running time in terms of clock ticks:
clock_t t;
t = clock();
//run algorithm
t = clock() - t;
printf ("It took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC);
However, this returns 0 when input size is small. How is this even possible ?
The clock has some granularity, dependent on several factors such as your OS.
Therefore, it may happen that your algorithm runs that fast that the clock did not have time update. Hence the measured duration of 0.
You can try to run your algorithm n times and divide the measured time by n to get a better idea of the time taken on small inputs.
The resolution of the standard C clock() function can vary heavily between systems and is likely too small to measure your algorithm. You have 2 options:
Use operating system specific functions
Repeat your algorithm several times until it takes long enough to be measured with clock()
For 1) you can use QueryPerformanceCounter() and QueryPerformanceFrequency() if your program runs under windows or use clock_gettime() if it runs on Linux.
Refer to these pages for further details:
QueryPerformanceCounter()
clock_gettime()
For 2) you have to execute your algorithm a given number of times sequentially so that the time reported by clock() is several magnitudes above the minimum granularity of clock(). Lets say clock() only works in steps of 12 microseconds, then the time consumed by the total test run should be at least 1.2 milliseconds so your time measurement has at most 1% deviation. Otherwise, if you measure a time of 12 micros you never know if it ran for 12.0 micros or maybe 23.9 micros, but the next bigger clock() tick just didn't happen. The more often your algorithm executes sequentially inside the time measurement, the more exact your time measurement will be. Also be sure to copy-paste the call to your algorithm for sequential executions; if you just use a loop counter in a for-loop, this may severely influence your measurement!
Related
It might be a silly question but, with OpenMP you can achieve to distribute the number of operations between all the cores your CPU has. Of course, it is going to be faster in 99% times because you went from a single core doing N operations to K cores doing the same amount operations at the same time.
Despite of this, the total amount of clock cycles should be the same, right? Because the number of operations is the same. Or I am wrong?
This question boils down more or less to the difference between CPU time and elapsed time. Indeed, we see more often than none here questions which start by "my code doesn't scale, why?", for which the first answer is "How did you measure the time?" (I let you make a quick search and I'm sure you'll find many results)
But to illustrate more how things work, let's imagine you have a fixed-size problem, for which you have an algorithm that is perfectly parallelized.You have 120 actions to do, each taking 1 second. Then, 1 CPU core would take 120s, 2 cores would take 60s, 3 cores 40s, etc.
That is the elapsed time that is decreasing. However, 2 cores, running for 60 seconds in parallel, will consume 120s of CPU time. This means that the overall number of clock cycles won't have reduced compared to having only one CPU core running.
In summary, for a perfectly parallelized problem, you expect to see your elapsed time scaling down perfectly with the number of cores used, and the CPU time to remain constant.
In reality, what you often see is the elapsed time to scale down less than expected, due to parallelization overheads and/or imperfect parallelization. By the meantime, you see the CPU time slightly increasing with the number of cores used, for the same reasons.
I think the answer depends on how you define the total amount of clock cycles. If you define it as the sum of all the clock cycles from the different cores then you are correct and there will not be fewer clock cycles. But if you define it as the amount of clock cycles for the "main" core between initiating and completing the distributed operations then it is hopefully fewer.
std::chrono advertises that it can report results down to the nanosecond level. On a typical x86_64 Linux or Windows machine, how accurate would one expect this to be? What would be the error bars for a measurement of 10 ns, 10 µs, 10 ms, and 10 s, for example?
It's most likely hardware and OS dependent. For example when I ask Windows what the clock frequency is using QueryPerformanceFrequency() I get 3903987, which if you take the inverse of that you get a clock period or resolution of about 256 nanoseconds. This is the value that that my operating system reports.
With std::chrono according to the docs the minimum representable duration is high_resolution_clock::period::num / high_resolution_clock::period::den.
The num and den are numerator and denominator. std::chrono::high_resolution_clock tells me the numerator is 1, and the denominator is 1 billion, supposedly corresponding to 1 nanosecond:
std::cout << (double)std::chrono::high_resolution_clock::period::num /
std::chrono::high_resolution_clock::period::den; // Results in a nanosecond.
So according to the std::chrono I have one nanosecond resolution but I don't believe it because the native OS system call is more likely to be reporting the more accurate frequency/period.
The accuracy will depend upon the application and how this application interacts with the operating system. I am not familiar with chrono specifically, but there are limitations at a lower level you must account for.
For example, if you timestamp network packets using the CPU, the measurement accuracy is very noisy. Even though the precision of the time measurement may be 1 nanosecond, the context switch time for the interrupt corresponding to the packet arrival may be ~1 microsecond. You can accurately measure when your application processes the packet, but not what time the packet arrived.
Short answer: Not accurate in microseconds and below.
Long answer:
I was interested to know about how much my two dp program takes time to execute. So I used the chrono library, but when I ran it it says 0 microseconds. So technically I was unable to compare. I can't increase the array size, because it will not be possible to extend it to 1e8.
So I wrote a sort program to test it and ran it for 100 and the following is the result:
Enter image description here
It is clearly visible that it is not consistent for same input, so I would recommend not to use for higher precision.
In my C++ program, I measure CPU time by the clock() command. As the code is executed on a cluster of different computers (running all the same OS, but having different hardware configuration, i.e. different CPUs), I am wonderung about measuring actual execution time. Here is my scenario:
As far as I read, clock() gives the amount of CPU clock ticks that passed since a fixed date. I measure the relative duration by calling clock() a second time and building the difference.
Now what defines the internal clock() in C++? If I have CPU A with 1.0 GHz and CPU B with 2.0 GHz and run the same code on them, how many clocks will CPU A and B take to finish? Does the clock() correspond to "work done"? Or is it really a "time"?
Edit: As the CLOCKS_PER_SEC is not set, I cannot use it for convertion of clocks to runtime in seconds. As the manual says, CLOCKS_PER_SEC depends on the hardware/architecture. That means there is a dependency of the clocks on the hardware. So, I really need to know what clock() gives me, without any additional calculation.
The clock() function should return the closest possible
representation of the CPU time used, regardless of the clock
spead of the CPU. Where the clock speed of the CPU might
intervene (but not necessarily) is in the granularity; more
often, however, the clock's granularity depends on some external
time source. (In the distant past, it was often based on the
power line frequence, with a granularity of 1/50 or 1/60 of
a second, depending on where you were.)
To get the time in seconds, you divide by CLOCKS_PER_SEC. Be
aware, however, that both clock() and CLOCKS_PER_SEC are
integral values, so the division is integral. You might want to
convert one to double before doing the division. In the past,
CLOCKS_PER_SEC also corresponded to the granularity, but
modern systems seem to just choose some large value (Posix
requires 1000000, regardless of the granularity); this means
that successive return values from clock() will "jump".
Finally, it's probably worth noting that in VC++, clock() is
broken, and returns wall clock time, rather than CPU time.
(This is probably historically conditionned; in the early days,
wall clock time was all that was available, and the people at
Microsoft probably think that there is code which depends on
it.)
You can convert clock ticks to real time by dividing the amount with CLOCKS_PER_SEC.
Note that since C++11 a more appropriate way of measuring elapsed time is by using std::steady_clock.
From man clock:
The value returned is the CPU time used so far as a clock_t; to get
the number of seconds used, divide by CLOCKS_PER_SEC. If the processor
time used is not available or its value cannot be represented, the
function returns the value (clock_t) -1.
I'd like to characterize the accuracy of a software timer. I'm not concerned so much about HOW accurate it is, but do need to know WHAT the accuracy is.
I've investigated c function clock(), and WinAPI's function QPC and timeGetTime, and I know that they're all hardware dependent.
I'm measuring a process that could take around 5-10 seconds, and my requirements are simple: I only need 0.1 second precision (resolution). But I do need to know what the accuracy is, worst-case.
while more accuracy would be preferred, I would rather know that the accuracy was poor (500ms) and account for it, than to believe that the accuracy was better (1 mS) but not be able to document it.
Does anyone have suggestions on how to characterize software clock accuracy?
Thanks
You'll need to distinguish accuracy, resolution and latency.
clock(), GetTickCount and timeGetTime() are derived from a calibrated hardware clock. Resolution is not great, they are driven by the clock tick interrupt which ticks by default 64 times per second or once every 15.625 msec. You can use timeBeginPeriod() to drive that down to 1.0 msec. Accuracy is very good, the clock is calibrated from a NTP server, you can usually count on it not being off more than a second over a month.
QPC has a much higher resolution, always better than one microsecond and as little as half a nanosecond on some machines. It however has poor accuracy, the clock source is a frequency picked up from the chipset somewhere. It is not calibrated and has typical electronic tolerances. Use it only to time short intervals.
Latency is the most important factor when you deal with timing. You have no use for a highly accurate timing source if you can't read it fast enough. And that's always an issue when you run code in user mode on a protected mode operating system. Which always has code that runs with higher priority than your code. Particularly device drivers are trouble-makers, video and audio drivers in particular. Your code is also subjected to being swapped out of RAM, requiring a page-fault to get loaded back. On a heavily loaded machine, not being able to run your code for hundreds of milliseconds is not unusual. You'll need to factor this failure mode into your design. If you need guaranteed sub-millisecond accuracy then only a kernel thread with real-time priority can give you that.
A pretty decent timer is the multi-media timer you get from timeSetEvent(). It was designed to provide good service for the kind of programs that require a reliable timer. You can make it tick at 1 msec, it will catch up with delays when possible. Do note that it is an asynchronous timer, the callback is made on a separate worker thread so you have to be careful taking care of proper threading synchronization.
Since you've asked for hard facts, here they are:
A typical frequency device controlling HPETs is the CB3LV-3I-14M31818
which specifies a frequency stability of +/- 50ppm between -40 °C and +85 °C.
A cheaper chip is the CB3LV-3I-66M6660. This device has a frequency stability of +/- 100 ppm between -20°C and 70°C.
As you see, 50 to 100ppm will result in a drift of 50 to 100 us/s, 180 to 360 ms/hour, or 4.32 to 8.64 s/day!
Devices controlling the RTC are typically somewhat better: The RV-8564-C2 RTC
module provides tolerances of +/- 10 to 20 ppm. Tighter tolerances are typically available in military version or on request. The deviation of this source is a factor of 5
less than that of the HPET. However, it is still 0.86 s/day.
All of the above values are maximum values as specified in the data sheet. Typical values may be considerably less, as mentioned in my comment, they are in the few ppm range.
The frequency values are also accompanied by thermal drift. The result of QueryPerformanceCounter() may be heavely influenced by thermal drift on systems operating with the ACPI Power Management Timer chip (example).
More information about timers: Clock and Timer Circuits.
For QPC, you can call QueryPerformanceFrequency to get the rate it updates at. Unless you are using time, you will get more than 0.5s timing accuracy anyway, but clock isn't all that accurate - quite often 10ms segments [although the apparently CLOCKS_PER_SEC is standardized at 1 million, making the numbers APPEAR more accurate].
If you do something along these lines, you can figure out how small a gap you can measure [although at REALLY high frequency you may not be able to notice how small, e.g. timestamp counter that updates every clock-cycle, and reading it takes 20-40 clockcycles]:
time_t t, t1;
t = time();
// wait for the next "second" to tick on.
while(t == (t1 = time())) /* do nothing */ ;
clock_t old = 0;
clock_t min_diff = 1000000000;
clock_t start, end;
start = clock();
int count = 0;
while(t1 == time())
{
clock_t c = clock();
if (old != 0 && c != old)
{
count ++;
clock_t diff;
diff = c - old;
if (min_diff > diff) min_diff = diff;
}
old = c;
}
end = clock();
cout << "Clock changed " << count << " times" << endl;
cout << "Smallest differece " << min_diff << " ticks" << endl;
cout << "One second ~= " << end - start << " ticks" << endl;
Obviously, you can apply same principle to other time-sources.
(Not compile-tested, but hopefully not too full of typos and mistakes)
Edit:
So, if you are measuring times in the range of 10 seconds, a timer that runs at 100Hz would give you 1000 "ticks". But it could be 999 or 1001, depending on your luck and you catch it just right/wrong, so that's 2000 ppm there - then the clock input may vary too, but it's much smaller variation ~ 100ppm at most. For Linux, the the clock() is updated at 100Hz (the actual timer that runs the OS may run at a higher frequency, but clock() in Linux will update at 100Hz or 10ms intervals [and it only updates when the CPU is being used, so sitting 5 seconds waiting for user input is 0 time].
In windows, clock() measures the actual time, same as your wrist watch does, not just the CPU being used, so 5 seconds waiting for user input is counted as 5 seconds of time. I'm not sure how accurate it is.
The other problem that you will find is that modern systems are not very good at repeatable timing in general - no matter what you do, the OS, the CPU and the memory all conspire together to make life a misery for getting the same amount of time for two runs. CPU's these days often run with intentionally variable clock (it's allowed to drift about 0.1-0.5%) to reduce electromagnetic radiation for EMC, (electromagnetic compatibility) testing spikes that can "sneak out" of that nicely sealed computer box.
In other words, even if you can get a very standardized clock, your test results will vary up and down a bit, depending on OTHER factors that you can't do anything about...
In summary, unless you are looking for a number to fill into a form that requires you to have a ppm number for your clock accuracy, and it's a government form that you can't NOT fill that information into, I'm not entirely convinced it's very useful to know the accuracy of the timer used to measure the time itself. Because other factors will play AT LEAST as big a role.
I want to calculate performance of a function in micro second precision on Windows platform.
Now Windows itself has milisecond granuality, so how can I achieve this.
I tried following sample, but not getting correct results.
LARGE_INTEGER ticksPerSecond = {0};
LARGE_INTEGER tick_1 = {0};
LARGE_INTEGER tick_2 = {0};
double uSec = 1000000;
// Get the frequency
QueryPerformanceFrequency(&ticksPerSecond);
//Calculate per uSec freq
double uFreq = ticksPerSecond.QuadPart/uSec;
// Get counter b4 start of op
QueryPerformanceCounter(&tick_1);
// The ope itself
Sleep(10);
// Get counter after opfinished
QueryPerformanceCounter(&tick_2);
// And now the op time in uSec
double diff = (tick_2.QuadPart/uFreq) - (tick_1.QuadPart/uFreq);
Run the operation in a loop a million times or so and divide the result by that number. That way you'll get the average execution time over that many executions. Timing one (or even a hundred) executions of a very fast operation is very unreliable, due to multitasking and whatnot.
compile it
look at the assembler output
count the number of each instruction in your function
apply the cycles per instruction on your target processor
end up with a cycle count
multiply by the clock speed you are running at
apply arbitrary scaling factors to account for cache misses and branch mis-predictions lol
(man I am so going to get down-voted for this answer)
No, you are probably getting an accurate result, QueryPerformanceCounter() works well for timing short intervals. What's wrong is the your expectation of the accuracy of Sleep(). It has a resolution of 1 millisecond, its accuracy is far worse. No better than about 15.625 milliseconds on most Windows machine.
To get it anywhere close to 1 millisecond, you'll have to call timeBeginPeriod(1) first. That probably will improve the match, ignoring the jitter you'll get from Windows being a multi-tasking operating system.
If you're doing this for offline profiling, a very simple way is to run the function 1000 times, measure to the closest millisecond and divide by 1000.
To get finer resolution than 1 ms, you will have to consult your OS documentation. There may be APIs to get timer resolution in microsecond resolution. If so, run your application many times and take the averages.
I like Matti Virkkunen's answer. Check the time, call the function a large number of times, check the time when you finish, and divide by the number of times you called the function. He did mention you might be off due to OS interrupts. You might vary the number of times you make the call and see a difference. Can you raise the priority of the process? Can you get it so all the calls within a single OS time slice?
Since you don't know when the OS might swap you out, you can put this all inside a larger loop to do the whole measurement a large number of times, and save the smallest number as that is the one that had the fewest OS interrupts. This still may be greater than the actual time for the function to execute because it may still contain some OS interrupts.
Sanjeet,
It looks (to me) like you're doing this exactly right. QueryPerformanceCounter is a perfectly good way to measure short periods of time with a high degree of precision. If you're not seeing the result you expected, it's most likely because the sleep isn't sleeping for the amount of time you expected it to! However, it is likely being measured correctly.
I want to go back to your original question about how to measure the time on windows with microsecond precision. As you already know, the high performance counter (i.e. QueryPerformanceCounter) "ticks" at the frequency reported by QueryPerformanceFrequency. That means that you can measure time with precision equal to:
1/frequency seconds
On my machine, QueryPerformanceFrequency reports 2337910 (counts/sec). That means that my computer's QPC can measure with precision 4.277e-7 seconds, or 0.427732 microseconds. That means that the smallest bit of time I can measure is 0.427732 microseconds. This, of course, gives you the precision that you originally asked for :) Your machine's frequency should be similar, but you can always do the math and check it.
Or you can use gettimeofday() which gives you a timeval struct that is a timestamp (down to µs)