What's the best timing resolution can i get on Linux

What's the best timing resolution can i get on Linux - c++

I'm trying to measure the time difference between 2 signals on the parallel port, but first i got to know how much accurate and precise is my measuring system (AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ × 2) on SUSE 12.1 x64.
So after some reading i decide to use clock_gettime(), first i get the clock_getres() value using this code:
/*
* This program prints out the clock resolution.
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( void )
{
struct timespec res;
if ( clock_getres( CLOCK_REALTIME, &res) == -1 ) {
perror( "clock get resolution" );
return EXIT_FAILURE;
}
printf( "Resolution is %ld nano seconds.\n",
res.tv_nsec);
return EXIT_SUCCESS;
}
and the out was: 1 nano second. And i was so happy!!
But here is my problem, when i tried to check that fact with this other code:
#include <iostream>
#include <time.h>
using namespace std;
timespec diff(timespec start, timespec end);
int main()
{
timespec time1, time2, time3,time4;
int temp;
time3.tv_sec=0;
time4.tv_nsec=000000001L;
clock_gettime(CLOCK_REALTIME, &time1);
NULL;
clock_gettime(CLOCK_REALTIME, &time2);
cout<<diff(time1,time2).tv_sec<<":"<<diff(time1,time2).tv_nsec<<endl;
return 0;
}
timespec diff(timespec start, timespec end)
{
timespec temp;
if ((end.tv_nsec-start.tv_nsec)<0) {
temp.tv_sec = end.tv_sec-start.tv_sec-1;
temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
} else {
temp.tv_sec = end.tv_sec-start.tv_sec;
temp.tv_nsec = end.tv_nsec-start.tv_nsec;
}
return temp;
}
this one calculate the time between the two calls of clock_gettime, the time3 and time4 are declared but not used in this example because i was doing tests with them.
The output in this example is fluctuating between 978 and 1467 ns. both numbers are multiples of 489, this make me think that 489 ns is my REAL resolution. far far from the 1 ns obtained above.
My question: is there ANY WAY of getting better results? am i missing something?
I really need at least 10ns resolution for my project. Come on! a GPS can get better resolution than a PC??

I realise this topic is long dead, but wanted to throw in my findings. This is a long answer so I have put the short answer here and those with the patience can wade through the rest. The not-quite-the-answer to the question is 700 ns or 1500 ns depending on which mode of clock_gettime() you used. The long answer is way more complicated.
For reference, the machine I did this work on is an old laptop that nobody wanted. It is an Acer Aspire 5720Z running Ubuntu 14.041 LTS.
The hardware:
RAM: 2.0 GiB // This is how Ubuntu reports it in 'System Settings' → 'Details'
Processor: Intel® Pentium(R) Dual CPU T2330 # 1.60GHz × 2
Graphics: Intel® 965GM x86/MMX/SSE2
I wanted to measure time accurately in an upcoming project and as a relative new comer to PC hardware regardless of operating system, I thought I would do some experimentation on the resolution of the timing hardware. I stumbled across this question.
Because of this question, I decided that clock_gettime() looks like it meets my needs. But my experience with PC hardware in the past has left me under-whelmed so I started fresh with some experiments to see what the actual resolution of the timer is.
The method: Collect successive samples of the result from clock_gettime() and look any patterns in the resolution. Code follows.
Results in a slightly longer Summary:
Not really a result. The stated resolution of the fields in the structure is in nanoseconds. The result of a call to clock_getres() is also tv_sec 0, tv_nsec 1. But previous experience has taught to not trust the resolution from a structure alone. It is an upper limit on precision and reality tends to be a whole lot more complex.
The actual resolution of the clock_gettime() result on my machine, with my program, with my operating system, on one particular day etc turns out to be 70 nanoseconds for mode 0 and 1. 70 ns is not too bad but unfortunately, this is not realistic as we will see in the next point. To complicate matters, the resolution appears to be 7 ns when using modes 2 and 3.
Duration of the clock_gettime() call is more like 1500 ns for modes 0 and 1. It doesn't make sense to me at all to claim 70 ns resolution on the time if it takes 20 times the resolution to get a value.
Some modes of clock_gettime() are faster than others. Modes 2 and 3 are clearly about half the wall-clock time of modes 0 and 1. Modes 0 and 1 are statistically indistinguishable from each other. Modes 2 and 3 are much faster than modes 0 and 1, with mode 3 being the fastest overall.
Before continuing, I better define the modes: Which mode is which?:
Mode 0 CLOCK_REALTIME // reference: http://linux.die.net/man/3/clock_gettime
Mode 1 CLOCK_MONOTONIC
Mode 2 CLOCK_PROCESS_CPUTIME_ID
Mode 3 CLOCK_THREAD_CPUTIME_ID
Conclusion: To me it doesn't make sense to talk about the resolution of the time intervals if the resolution is smaller than the length of time the function takes to get the time interval. For example, if we use mode 3, we know that the function completes within 700 nanoseconds 99% of the time. And we further know that the time interval we get back will be a multiple of 7 nanoseconds. So the 'resolution' of 7 nanoseconds, is 1/100th of the time to do the call to get the time. I don't see any value in the 7 nanosecond change interval. There are 3 different answers to the question of resolution: 1 ns, 7 or 70 ns, and finally 700 or 1500 ns. I favour the last figure.
After all is said and done, if you want to measure the performance of some operation, you need to keep in mind how long the clock_gettime() call takes – that is 700 or 1500 ns. There is no point trying to measure something that takes 7 nanoseconds for example. For the sake of argument, lets say you were willing to live with 1% error on your performance test conclusions. If using mode 3 (which I think I will be using in my project) you would have to say that the interval you need to be measuring needs to be 100 times 700 nanoseconds or 70 microseconds. Otherwise your conclusions will have more than 1% error. So go ahead and measure your code of interest, but if your elapsed time in the code of interest is less that 70 microseconds, then you better go and loop through the code of interest enough times so that the interval is more like 70 microseconds or more.
Justification for these claims and some details:
Claim 3 first. This is simple enough. Just run clock_gettime() a large number of times and record the results in an array, then process the results. Do the processing outside the loop so that the time between clock_gettime() calls is as short as possible.
What does all that mean? See the graph attached. For mode 0 for example, the call to clock_gettime() takes less than 1.5 microseconds most of the time. You can see that mode 0 and mode 1 are basically the same. However, modes 2 and 3 are very different to modes 0 and 1, and slightly different to each other. Modes 2 and 3 take about half the wall-clock time for clock_gettime() compared to modes 0 and 1. Also note that mode 0 and 1 are slightly different to each other – unlike modes 2 and 3. Note that mode 0 and 1 differ by 70 nanoseconds – which is a number which we will come back to in claim #2.
The attached graph is range-limited to 2 microseconds. Otherwise the outliers in the data prevents the graph from conveying the previous point. Something the graph doesn't make clear then is that the outliers for modes 0 and 1 are much worse than the outliers for modes 2 and 3. In other words, not only is the average and the statistical 'mode' (the value which occurs the most) and the median (i.e. the 50th percentile) for all these modes different so is there maximum values and their 99th percentiles.
The graph attached is for 100,001 samples for each of the four modes. Please note that the tests graphed were using a CPU mask of processor 0 only. Whether I used CPU affinity or not didn't seem to make any difference to the graph.
Claim 2: If you look closely at the samples collected when preparing the graph, you soon notice that the difference between the differences (i.e. the 2nd order differences) is relatively constant – at around 70 nanoseconds (fore Modes 0 and 1 at least). To repeat this experiment, collect 'n' samples of clock time as before. Then calculate the differences between each sample. Now sort the differences into order (e.g. sort -g) and then derive the individual unique differences (e.g. uniq -c).
For example:
$ ./Exp03 -l 1001 -m 0 -k | sort -g | awk -f mergeTime2.awk | awk -f percentages.awk | sort -g
1.118e-06 8 8 0.8 0.8 // time,count,cumulative count, count%, cumulative count%
1.188e-06 17 25 1.7 2.5
1.257e-06 9 34 0.9 3.4
1.327e-06 570 604 57 60.4
1.397e-06 301 905 30.1 90.5
1.467e-06 53 958 5.3 95.8
1.537e-06 26 984 2.6 98.4
<snip>
The difference between the durations in the first column is often 7e-8 or 70 nanoseconds. This can become more clear by processing the differences:
$ <as above> | awk -f differences.awk
7e-08
6.9e-08
7e-08
7e-08
7e-08
7e-08
6.9e-08
7e-08
2.1e-07 // 3 lots of 7e-08
<snip>
Notice how all the differences are integer multiples of 70 nanoseconds? Or at least within rounding error of 70 nanoseconds.
This result may well be hardware dependent but I don't actually know what limits this to 70 nanoseconds at this time. Perhaps there is 14.28 MHz oscillator somewhere?
Please note that in practise I use a much larger number of samples such as 100,000, not 1000 as above.
Relevant code (attached):
'Expo03' is the program which calls clock_gettime() as fast as possible. Note that typical usage would be something like:
./Expo03 -l 100001 -m 3
This would call clock_gettime() 100,001 times so that we can compute 100,000 differences. Each call to clock_gettime() in this example would be using mode 3.
MergeTime2.awk is a useful command which is a glorified 'uniq' command. The issue is that the 2nd order differences are often in pairs of 69 and 1 nanosecond, not 70 (for Mode 0 and 1 at least) as I have lead you to believe so far. Because there is no 68 nanosecond difference or a 2 nanosecond difference, I have merged these 69 and 1 nanosecond pairs into one number of 70 nanoseconds. Why the 69/1 behaviour occurs at all is interesting, but treating these as two separate numbers mostly added 'noise' to the analysis.
Before you ask, I have repeated this exercise avoiding floating point, and the same problem still occurs. The resulting tv_nsec as an integer has this 69/1 behaviour (or 1/7 and 1/6) so please don't assume that this is an artefact caused by floating point subtraction.
Please note that I am confident with this 'simplification' for 70 ns and for small integer multiples of 70 ns, but this approach looks less robust for the 7 ns case especially when you get 2nd order differences of 10 times the 7 ns resolution.
percentages.awk and differences.awk attached in case.
Stop press: I can't post the graph as I don't have a 'reputation of at least 10'. Sorry 'bout that.
Rob Watson
21 Nov 2014
Expo03.cpp
/* Like Exp02.cpp except that here I am experimenting with
modes other than CLOCK_REALTIME
RW 20 Nov 2014
*/
/* Added CPU affinity to see if that had any bearing on the results
RW 21 Nov 2014
*/
#include <iostream>
using namespace std;
#include <iomanip>
#include <stdlib.h> // getopts needs both of these
#include <unistd.h>
#include <errno.h> // errno
#include <string.h> // strerror()
#include <assert.h>
// #define MODE CLOCK_REALTIME
// #define MODE CLOCK_MONOTONIC
// #define MODE CLOCK_PROCESS_CPUTIME_ID
// #define MODE CLOCK_THREAD_CPUTIME_ID
int main(int argc, char ** argv)
{
int NumberOf = 1000;
int Mode = 0;
int Verbose = 0;
int c;
// l loops, m mode, h help, v verbose, k masK
int rc;
cpu_set_t mask;
int doMaskOperation = 0;
while ((c = getopt (argc, argv, "l:m:hkv")) != -1)
{
switch (c)
{
case 'l': // ell not one
NumberOf = atoi(optarg);
break;
case 'm':
Mode = atoi(optarg);
break;
case 'h':
cout << "Usage: <command> -l <int> -m <mode>" << endl
<< "where -l represents the number of loops and "
<< "-m represents the mode 0..3 inclusive" << endl
<< "0 is CLOCK_REALTIME" << endl
<< "1 CLOCK_MONOTONIC" << endl
<< "2 CLOCK_PROCESS_CPUTIME_ID" << endl
<< "3 CLOCK_THREAD_CPUTIME_ID" << endl;
break;
case 'v':
Verbose = 1;
break;
case 'k': // masK - sorry! Already using 'm'...
doMaskOperation = 1;
break;
case '?':
cerr << "XXX unimplemented! Sorry..." << endl;
break;
default:
abort();
}
}
if (doMaskOperation)
{
if (Verbose)
{
cout << "Setting CPU mask to CPU 0 only!" << endl;
}
CPU_ZERO(&mask);
CPU_SET(0,&mask);
assert((rc = sched_setaffinity(0,sizeof(mask),&mask))==0);
}
if (Verbose) {
cout << "Verbose: Mode in use: " << Mode << endl;
}
if (Verbose)
{
rc = sched_getaffinity(0,sizeof(mask),&mask);
// cout << "getaffinity rc is " << rc << endl;
// cout << "getaffinity mask is " << mask << endl;
int numOfCPUs = CPU_COUNT(&mask);
cout << "Number of CPU's is " << numOfCPUs << endl;
for (int i=0;i<sizeof(mask);++i) // sizeof(mask) is 128 RW 21 Nov 2014
{
if (CPU_ISSET(i,&mask))
{
cout << "CPU " << i << " is set" << endl;
}
//cout << "CPU " << i
// << " is " << (CPU_ISSET(i,&mask) ? "set " : "not set ") << endl;
}
}
clockid_t cpuClockID;
int err = clock_getcpuclockid(0,&cpuClockID);
if (Verbose)
{
cout << "Verbose: clock_getcpuclockid(0) returned err " << err << endl;
cout << "Verbose: clock_getcpuclockid(0) returned cpuClockID "
<< cpuClockID << endl;
}
timespec timeNumber[NumberOf];
for (int i=0;i<NumberOf;++i)
{
err = clock_gettime(Mode, &timeNumber[i]);
if (err != 0) {
int errSave = errno;
cerr << "errno is " << errSave
<< " NumberOf is " << NumberOf << endl;
cerr << strerror(errSave) << endl;
cerr << "Aborting due to this error" << endl;
abort();
}
}
for (int i=0;i<NumberOf-1;++i)
{
cout << timeNumber[i+1].tv_sec - timeNumber[i].tv_sec
+ (timeNumber[i+1].tv_nsec - timeNumber[i].tv_nsec) / 1000000000.
<< endl;
}
return 0;
}
MergeTime2.awk
BEGIN {
PROCINFO["sorted_in"] = "#ind_num_asc"
}
{array[$0]++}
END {
lastX = -1;
first = 1;
for (x in array)
{
if (first) {
first = 0
lastX = x; lastCount = array[x];
} else {
delta = x - lastX;
if (delta < 2e-9) { # this is nasty floating point stuff!!
lastCount += array[x];
lastX = x
} else {
Cumulative += lastCount;
print lastX "\t" lastCount "\t" Cumulative
lastX = x;
lastCount = array[x];
}
}
}
print lastX "\t" lastCount "\t" Cumulative+lastCount
}
percentages.awk
{ # input is $1 a time interval $2 an observed frequency (i.e. count)
# $3 is a cumulative frequency
b[$1]=$2;
c[$1]=$3;
sum=sum+$2
}
END {
for (i in b) print i,b[i],c[i],(b[i]/sum)*100, (c[i]*100/sum);
}
differences.awk
NR==1 {
old=$1;next
}
{
print $1-old;
old=$1
}

As far as I know, Linux running on a PC will generally not be able to give you timer accuracy in the nanoseconds range. This is mainly due to the type of task/process scheduler used in the kernel. This is as much a result of the kernel as it is of the hardware.
If you need timing with nanosecond resolution I'm afraid that you're out of luck. However you should be able to get micro-second resolution which should be good enough for most scenarios - including your parallel port application.
If you need timing in the nano-seconds range to be accurate to the nano-second you will need a dedicated hardware solution most likely; with a really accurate oscillator (for comparison, the base clock frequency of most x86 CPUs is in the range of mega-hertz before the multipliers)
Finally, if you're looking to replace the functionality of an oscilloscope with your computer that's just not going to work beyond relatively low frequency signals. You'd be much better off investing in a scope - even a simple, portable, hand-held that plugs into your computer for displaying the data.

RDTSCP on your AMD Athlon 64 X2 will give you the time stamp counter with resolution dependent upon your clock. However accuracy is different to resolution, you need to lock thread affinity and disable interrupts (see IRQ routing).
This entails dropping down to assembler or for Windows developers using MSVC 2008 instrinsics.
RedHat with RHEL5 introduced user-space shims that replace gettimeofday with high resolution RDTSCP calls:
http://developer.amd.com/Resources/documentation/articles/Pages/1214200692_5.aspx
https://web.archive.org/web/20160812215344/https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html
Also, check your hardware an AMD 5200 has a 2.6Ghz clock which has 0.4ns interval and the cost of gettimeofday with RDTSCP is 221 cycles that equals 88ns at best.

Related

Code performance strick measurement

I’m creating performance framework tool for measuring individual message processing time in CentOS 7. I reserved one CPU for this task with isolcpus kernel option and I run it using taskset.
Ok, now the problem. I trying to measure the max processing time among several messages. The processing time is <= 1000ns, but when I run many iterations I get very high results (> 10000ns).
Here I created some simple code which does nothing interesting but shows the problem. Depending on the number of iterations i can get results like:
max: 84 min: 23 -> for 1000 iterations
max: 68540 min: 11 -> for 100000000 iterations
I'm trying to understand from where this difference came from? I tried to run this with real-time scheduling with highest priority. Is there some way to prevent that?
#include <iostream>
#include <limits>
#include <time.h>
const unsigned long long SEC = 1000L*1000L*1000L;
inline int64_t time_difference( const timespec &start,
const timespec &stop ) {
return ( (stop.tv_sec * SEC - start.tv_sec * SEC) +
(stop.tv_nsec - start.tv_nsec));
}
int main()
{
timespec start, stop;
int64_t max = 0, min = std::numeric_limits<int64_t>::max();
for(int i = 0; i < 100000000; ++i){
clock_gettime(CLOCK_REALTIME, &start);
clock_gettime(CLOCK_REALTIME, &stop);
int64_t time = time_difference(start, stop);
max = std::max(max, time);
min = std::min(min, time);
}
std::cout << "max: " << max << " min: " << min << std::endl;
}

You can't really reduce jitter to zero even with isolcpus, since you still have at least the following:
1) Interrupts delivered to your CPU (you may be able to reduce this my messing with irq affinity - but probably not to zero).
2) Clock timer interrupts are still scheduled for your process and may do a variable amount of work on the kernel side.
3) The CPU itself may pause briefly for P-state or C-state transitions, or other reasons (e.g., to let voltage levels settle after turning on AVX circuitry, etc).

Let us check the documentation...
Isolation will be effected for userspace processes - kernel threads may still get scheduled on the isolcpus isolated CPUs.
So it seems that there is no guarantee of perfect isolation, at least not from the kernel.

Framelimiter | Why are there extra milliseconds?

After a lot of testing with this thing, I still can't figure out why there are extra milliseconds appended to the millisecond limit.
In this case, the whole running loop should last 4000ms, and then print 4000 followed by some other data, however it is always around 4013ms.
I currently know that the problem isn't the stress testing, since without it, it's still at around 4013ms. Besides, there's is a limit for how long the stress testing can take, and the time is justified by how much rendering is able to be done in the remaining time. I also know that it isn't "SDL_GetTicks" including the time I'm initialising variables, since it only starts timing when it is first called. It's not the time it takes to call the function either, because I tested this with a very lightweight nanosecond timer as well, and the result is the same.
Here's some of my results, that are printed at the end:
4013 100 993 40
4013 100 1000 40
4014 100 1000 40
4012 100 992 40
4015 100 985 40
4013 100 1000 40
4022 100 986 40
4014 100 1000 40
4017 100 993 40
Unlike the third column (the amount of frames rendered), the first column shouldn't vary by much more than the few nanoseconds it took to exit the loops and such. Meaning it shouldn't even show a difference since the scope of the timer is milliseconds in this case.
I recompiled between all of them, and the list is pretty much continues the same.
Here's the code:
#include <iostream>
#include <SDL/SDL.h>
void stress(int n) {
n = n + n - n * n + n * n;
}
int main(int argc, char **argv) {
int running = 100,
timestart = 0, timestep = 0,
rendering = 0, logic = 0,
SDL_Init(SDL_INIT_EVERYTHING);
while(running--) { // - Running loop
timestart = SDL_GetTicks();
std::cout << "logic " << logic++ << std::endl;
for(int i = 0; i < 9779998; i++) { // - Stress testing
if(SDL_GetTicks() - timestart >= 30) { // - Maximum of 30 milliseconds spent running logic
break;
}
stress(i);
}
while(SDL_GetTicks() - timestart < 1) { // - Minimum of one millisecond to run through logic
;
}
timestep = SDL_GetTicks() - timestart;
while(40 > timestep) {
timestart = SDL_GetTicks();
std::cout << "rendering " << rendering++ << std::endl;
while(SDL_GetTicks() - timestart < 1) { // - Maximum of one rendering frame per millisecond
;
}
timestep += SDL_GetTicks() - timestart;
}
}
std::cout << SDL_GetTicks() << " " << logic << " " << rendering << " " << timestep << std::endl;
SDL_Quit();
return 0;
}

Elaborating on crowders comment - if your OS decides to switch tasks you will end up with a random error between 0 and 1 ms(or -.5 and .5 if SDL_GetTicks would be rounding it's internal timer, but your results always greater than expected suggest it is actually truncating). These will 'even out' within your next busy wait, but not near the end of the loop - since there is none 'next busy wait' there. To counteract it you need a reference point from before you start your game loop and compare it with GetTicks to measure how much time have "leaked". Also your approach of X miliseconds per frame and busy waits/breaks in the middle of computation isn't the cleanest i've seen. You should probably google about game loops and read a bit.

C++ Trouble with ctime's clock() method in VS2010 under VMWare Fusion/Boot Camp

I'm having trouble getting anything useful from the clock() method in the ctime library in particular situations on my Mac. Specifically, if I'm trying to run VS2010 in Windows 7 under either VMWare Fusion or on Boot Camp, it always seems to return the same value. Some test code to test the issue:
#include <time.h>
#include "iostream"
using namespace std;
// Calculate the factorial of n recursively.
unsigned long long recursiveFactorial(int n) {
// Define the base case.
if (n == 1) {
return n;
}
// To handle other cases, call self recursively.
else {
return (n * recursiveFactorial(n - 1));
}
}
int main() {
int n = 60;
unsigned long long result;
clock_t start, stop;
// Mark the start time.
start = clock();
// Calculate the factorial of n;
result = recursiveFactorial(n);
// Mark the end time.
stop = clock();
// Output the result of the factorial and the elapsed time.
cout << "The factorial of " << n << " is " << result << endl;
cout << "The calculation took " << ((double) (stop - start) / CLOCKS_PER_SEC) << " seconds." << endl;
return 0;
}
Under Xcode 4.3.3, the function executes in about 2 μs.
Under Visual Studio 2010 in a Windows 7 virtual machine (under VMWare Fusion 4.1.3), the same code gives an execution time of 0; this machine is given 2 of the Mac’s 4 cores and 2GB RAM.
Under Boot Camp running Windows 7, again I get an execution time of 0.
Is this a question of being "too far from the metal"?

It could be that the resolution of the timer is not as high under the virtual machine. The compiler can easily convert the tail recursion into a loop; 60 multiplications don't tend to take a terribly long time. Try computing something significantly more costly, like Fibonacci numbers (recursively, of course) and you should see the timer go on.

From time.h included with MSVC,
#define CLOCKS_PER_SEC 1000
which means clock() only has a resolution of 1 millisecond when using the Visual C++ runtime libraries, so any set of operations that takes less than that will almost always be measured as having zero time elapsed.
For higher resolution timing on Windows that can help you, check out QueryPerformanceCounter
and this sample code.

Timing woes with clock_gettime() CUDA

I wanted to write an CUDA code where I could see firsthand the benefits that CUDA offered for speeding up applications.
Here is is a CUDA code I have written using Thrust ( http://code.google.com/p/thrust/ )
Briefly, all that the code does is create two 2^23 length integer vectors,one on the host and one on the device identical to each other, and sorts them. It also (attempts to) measure time for each.
On the host vector I use std::sort. On the device vector I use thrust::sort.
For compilation I used
nvcc sortcompare.cu -lrt
The output of the program at the terminal is
Desktop: ./a.out
Host Time taken is: 19 . 224622882 seconds
Device Time taken is: 19 . 321644143 seconds
Desktop:
The first std::cout statement is produced after 19.224 seconds as stated. Yet the second std::cout statement (even though it says 19.32 seconds) is produced immediately after the first
std::cout statement. Note that I have used different time_stamps for measurements in clock_gettime() viz ts_host and ts_device
I am using Cuda 4.0 and NVIDIA GTX 570 compute capability 2.0
#include<iostream>
#include<vector>
#include<algorithm>
#include<stdlib.h>
//For timings
#include<time.h>
//Necessary thrust headers
#include<thrust/sort.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/copy.h>
int main(int argc, char *argv[])
{
int N=23;
thrust::host_vector<int>H(1<<N);//create a vector of 2^N elements on host
thrust::device_vector<int>D(1<<N);//The same on the device.
thrust::host_vector<int>dummy(1<<N);//Copy the D to dummy from GPU after sorting
//Set the host_vector elements.
for (int i = 0; i < H.size(); ++i) {
H[i]=rand();//Set the host vector element to pseudo-random number.
}
//Sort the host_vector. Measure time
// Reset the clock
timespec ts_host;
ts_host.tv_sec = 0;
ts_host.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Start clock
thrust::sort(H.begin(),H.end());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Stop clock
std::cout << "\nHost Time taken is: " << ts_host.tv_sec<<" . "<< ts_host.tv_nsec <<" seconds" << std::endl;
D=H; //Set the device vector elements equal to the host_vector
//Sort the device vector. Measure time.
timespec ts_device;
ts_device.tv_sec = 0;
ts_device.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Start clock
thrust::sort(D.begin(),D.end());
thrust::copy(D.begin(),D.end(),dummy.begin());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Stop clock
std::cout << "\nDevice Time taken is: " << ts_device.tv_sec<<" . "<< ts_device.tv_nsec <<" seconds" << std::endl;
return 0;
}

You are not checking the return value of clock_settime. I would guess it is failing, probably with errno set to EPERM or EINVAL. Read the documentation and always check your return values!
If I'm right, you are not resetting the clock as you think you are, hence the second timing is cumulative with the first, plus some extra stuff you don't intend to count at all.
The right way to do this is to call clock_gettime only, storing the result first, doing the computation, then subtracting the original time from the end time.

Performance issues with C++ (using VC++ 2010): at runtime, my program seems to randomly wait for a while

I'm currently trying to code a certain dynamic programming approach for a vehicle routing problem. At a certain point, I have a partial route that I want to add to a minmaxheap in order to keep the best 100 partial routes at a same stage. Most of the program runs smooth but when I actually want to insert a partial route into the heap, things tend to go a bit slow. That particural code is shown below:
clock_t insert_start, insert_finish, check1_finish, check2_finish;
insert_start = clock();
check2_finish = clock();
if(heap.get_vector_size() < 100) {
check1_finish= clock();
heap.insert(expansion);
cout << "node added" << endl;
}
else {
check1_finish = clock();
if(expansion.get_cost() < heap.find_max().get_cost() ) {
check2_finish = clock();
heap.delete_max();
heap.insert(expansion);
cout<< "worst node deleted and better one added" <<endl;
}
else {
check2_finish = clock();
cout << "cost too high check"<<endl;
}
}
number_expansions++;
cout << "check 1 takes " << check1_finish - insert_start << " ms" << endl;
cout << "check 2 takes " << check2_finish - check1_finish << "ms " << endl;
insert_finish = clock();
cout << "Inserting an expanded state into the heap takes " << insert_finish - insert_start << " clocks" << endl;
A typical output is this:
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 0 clocks
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 16 clocks
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 0 clocks
I know it's hard to say something about the code when this block uses functions that are implemented elsewhere but I'm flabbergasted as to why this sometimes takes less than a ms and sometimes takes up to 16 ms. The program should execute this block thousands of times so these small hiccups are really slowing things down enormously.
My only guess is that something happens with the vector in the heap class that stores all these states but I reserve place for a 100 items in the constructor using vector::reserve so I don't see how this could still be a problem.
Thanks!

Preempting. Your program may be preempted by the operating system, so some other program can run for a bit.
Also, it's not 16 ms. It's 16 clock ticks: http://www.cplusplus.com/reference/clibrary/ctime/clock/
If you want ms, you need to do:
cout << "Inserting an expanded state into the heap takes "
<< (insert_finish - insert_start) * 1000 / CLOCKS_PER_SEC
<< " ms " << endl;
Finally, you're setting insert_finish after printing out the other results. Try setting it immediately after your if/else block. The cout command is a good time to get preempted by another process.

My only guess is that something
happens with the vector in the heap
class that stores all these states but
I reserve place for a 100 items in the
constructor using vector::reserve so I
don't see how this could still be a
problem.
Are you using std::vector to implement it? Insert is taking linear time for std::vector. Also delete max is can take time if you are not using a sorted container.
I will suggest you to use a std::set or std::multiset. Insert, delete and find take always ln(n).

Try to measure time using QueryPerformanceCounter, because I think that clock function could not be very accurate. Probably clock has the same accuracy as windows scheduler - 10 ms for single cpu and 15 or 16 ms for multicore cpu. QueryPerformanceCounter together with QueryPerformanceFreq can give you nanosecond resolution.

It looks like you are measureing "wall time", not CPU time. Windows itself is not a realtime OS. Occasional large hiccups from high-priority things like device drivers is not at all uncommon.
On Windows if I'm manually trying to look for bottlenecks in code, I use RDTSC instead. Even better would be to not do it manually, but use a profiler.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

What's the best timing resolution can i get on Linux - c++

Related

Code performance strick measurement

Framelimiter | Why are there extra milliseconds?

C++ Trouble with ctime's clock() method in VS2010 under VMWare Fusion/Boot Camp

Timing woes with clock_gettime() CUDA

Performance issues with C++ (using VC++ 2010): at runtime, my program seems to randomly wait for a while

Categories

Resources