I wanted to write an CUDA code where I could see firsthand the benefits that CUDA offered for speeding up applications.
Here is is a CUDA code I have written using Thrust ( http://code.google.com/p/thrust/ )
Briefly, all that the code does is create two 2^23 length integer vectors,one on the host and one on the device identical to each other, and sorts them. It also (attempts to) measure time for each.
On the host vector I use std::sort. On the device vector I use thrust::sort.
For compilation I used
nvcc sortcompare.cu -lrt
The output of the program at the terminal is
Desktop: ./a.out
Host Time taken is: 19 . 224622882 seconds
Device Time taken is: 19 . 321644143 seconds
Desktop:
The first std::cout statement is produced after 19.224 seconds as stated. Yet the second std::cout statement (even though it says 19.32 seconds) is produced immediately after the first
std::cout statement. Note that I have used different time_stamps for measurements in clock_gettime() viz ts_host and ts_device
I am using Cuda 4.0 and NVIDIA GTX 570 compute capability 2.0
#include<iostream>
#include<vector>
#include<algorithm>
#include<stdlib.h>
//For timings
#include<time.h>
//Necessary thrust headers
#include<thrust/sort.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/copy.h>
int main(int argc, char *argv[])
{
int N=23;
thrust::host_vector<int>H(1<<N);//create a vector of 2^N elements on host
thrust::device_vector<int>D(1<<N);//The same on the device.
thrust::host_vector<int>dummy(1<<N);//Copy the D to dummy from GPU after sorting
//Set the host_vector elements.
for (int i = 0; i < H.size(); ++i) {
H[i]=rand();//Set the host vector element to pseudo-random number.
}
//Sort the host_vector. Measure time
// Reset the clock
timespec ts_host;
ts_host.tv_sec = 0;
ts_host.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Start clock
thrust::sort(H.begin(),H.end());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Stop clock
std::cout << "\nHost Time taken is: " << ts_host.tv_sec<<" . "<< ts_host.tv_nsec <<" seconds" << std::endl;
D=H; //Set the device vector elements equal to the host_vector
//Sort the device vector. Measure time.
timespec ts_device;
ts_device.tv_sec = 0;
ts_device.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Start clock
thrust::sort(D.begin(),D.end());
thrust::copy(D.begin(),D.end(),dummy.begin());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Stop clock
std::cout << "\nDevice Time taken is: " << ts_device.tv_sec<<" . "<< ts_device.tv_nsec <<" seconds" << std::endl;
return 0;
}
You are not checking the return value of clock_settime. I would guess it is failing, probably with errno set to EPERM or EINVAL. Read the documentation and always check your return values!
If I'm right, you are not resetting the clock as you think you are, hence the second timing is cumulative with the first, plus some extra stuff you don't intend to count at all.
The right way to do this is to call clock_gettime only, storing the result first, doing the computation, then subtracting the original time from the end time.
Related
As silly as it seems, I would like to know whether there may be pitfalls when trying to reconcile the time costs for a for loop, as measured
either from time points just outside the for loop (global or external time cost)
or, from time points being inside the loop, and being cumulatively considered (local or internal time cost) ?
The example below illustrates my difficulties getting two equal measurements:
#include <iostream>
#include <vector> // std::vector
#include <ctime> // clock(), ..
int main(){
clock_t clockStartLoop;
double timeInternal(0)// the time cost of the loop, summing all time costs of commands within the "for" loop
, timeExternal // time cost of the loop, as measured outside the boundaries of "for" loop
;
std::vector<int> vecInt; // will be [0,1,..,10000] after the loop below
clock_t costExternal(clock());
for(int i=0;i<10000;i++){
clockStartLoop = clock();
vecInt.push_back(i);
timeInternal += clock() - clockStartLoop; // incrementing internal time cost
}
timeInternal /= CLOCKS_PER_SEC;
timeExternal = (clock() - costExternal)/(double)CLOCKS_PER_SEC;
std::cout << "timeExternal = "<< timeExternal << " s ";
std::cout << "vs timeInternal = " << timeInternal << std::endl;
std::cout << "We have a ratio of " << timeExternal/timeInternal << " between the two.." << std::endl;
}
I usually get a ratio around 2 as output e.g.
timeExternal = 0.008407 s vs timeInternal = 0.004287
We have a ratio of 1.96105 between the two..
, whereas I was hoping a ratio closer to 1.
Is it just because there are operations internal to the loop which are not measured by the clock() difference (such as incrementing timeInternal) ?
Could the i++ operation in the for(..) be non-negligible in the external measurement and also explain the difference with the internal one ?
I'm actually dealing with a more complex code and I would like to isolate time costs within a loop, being sure that all the time slices I consider do make up a complete pie (which I never achieved until now..). Thanks a lot
timeExternal = 0.008407 s vs timeInternal = 0.004287 We have a ratio of 1.96105 between the two..
A ratio of ~2 is to be expected - by far the heaviest call in your loop is clock() itself (on most systems clock() is a syscall to the kernel).
Imagine that clock() implementation looks like the following pseudocode:
clock_t clock() {
go_to_kernel(); // very long operation
clock_t rc = query_process_clock();
return_from_kernel(); // very long operation
return rc;
}
Now going back to the loop, we can annotate the places where time is spent:
for(int i=0;i<10000;i++){
// go_to_kernel - very long operation
clockStartLoop = clock();
// return_from_kernel - very long operation
vecInt.push_back(i);
// go_to_kernel - very long operation
timeInternal += clock() - clockStartLoop;
// return_from_kernel - very long operation
}
So between the two calls to clock() we have 2 long operations, with a total in the loop of 4. Hence the ratio of 2-to-1.
Is it just because there are operations internal to the loop which are not measured by the clock() difference (such as incrementing timeInternal) ?
No, incrementing timeInterval is negligible.
Could the i++ operation in the for(..) be non-negligible in the external measurement and also explain the difference with the internal one ?
No, i++ is also negligible. Remove the inner calls to clock() and you will see a much faster execution time. On my system it was 0.00003 s.
The next most expensive operation after clock() is vector::push_back(), because it needs to resize the vector. This is amortized by a quadratic growth factor and can be eliminated entirely by calling vector::reserve() before entering the loop.
Conclusion: when benchmarking, make sure to time entire loops, not individual iterations. Better yet, use frameworks like Google Benchmark, which will help to avoid many other pitfalls (like compiler optimizations). There's also quick-bench.com for simple cases (based on Google Benchmark).
I have a problem printing a sparse matrix in a c++/mpi program that I hope you could help me solve.
Problem: I need to print a sparse matrix as a list of 3-ples (x, y, v_xy) in a .txt file in a program that has been parallelized with MPI. Since I am new to MPI, I decided not to deal with the parallelized IO instructions provided by the library and let the master processor (0 in my case) print the output. However, the time for printing the matrix increases when I increase the number of processors:
1 processor: 11,7 secs
2 processors: 26,4 secs
4 processors: 25,4 secs
I have already verified that the output is exactly the same in the three cases. Here is the relevant section of the code:
if (rank == 0)
{
sw.start();
std::ofstream ofs_output(output_file);
targets.print(ofs_output);
ofs_output.close();
sw.stop();
time_output = sw.get_duration();
std::cout << time_output << std::endl;
}
My stopwatch sw is measuring wall clock time using the gettimeofday function.
The print method for the targets matrix is the following:
void sparse_matrix::print(std::ofstream &ofs)
{
int temp_row;
for (const_iterator iter_row = _matrix.begin(); iter_row != _matrix.end(); ++iter_row)
{
temp_row = (*iter_row).get_key();
for (value_type::const_iterator iter_col = (*iter_row).get_value().begin();
iter_col != (*iter_row).get_value().end(); ++iter_col)
{
ofs << temp_row << "," << (*iter_col).get_key() << "," << (*iter_col).get_value() << std::endl;
}
}
}
I do not understand what is causing the slow-down since only processor 0 does the output and this is the very last operation of the program: all the other processors are done while processor 0 prints the output. Do you have any idea?
Well, I finally understood what was causing the problem. Running my program, parallelized on MPI, on a linux virtual machine drastically increased the time for printing a large amount of data in a .txt file when increasing the number of cores used. The problem is caused by the virtual machine, which does not behave correctly when using MPI. I tested the same program on a physical 8-core machine and the time for printing the output does not increase with the number of cores used.
I just started to play with Boost.Compute, to see how much speed it can bring to us, I wrote a simple program:
#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>
namespace compute = boost::compute;
int main()
{
// generate random data on the host
std::vector<float> host_vector(16000);
std::generate(host_vector.begin(), host_vector.end(), rand);
BOOST_FOREACH (auto const& platform, compute::system::platforms())
{
std::cout << "====================" << platform.name() << "====================\n";
BOOST_FOREACH (auto const& device, platform.devices())
{
std::cout << "device: " << device.name() << std::endl;
compute::context context(device);
compute::command_queue queue(context, device);
compute::vector<float> device_vector(host_vector.size(), context);
// copy data from the host to the device
compute::copy(
host_vector.begin(), host_vector.end(), device_vector.begin(), queue
);
auto start = boost::chrono::high_resolution_clock::now();
compute::transform(device_vector.begin(),
device_vector.end(),
device_vector.begin(),
compute::sqrt<float>(), queue);
auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
std::cout << "-------------------\n";
}
}
std::cout << "====================plain====================\n";
auto start = boost::chrono::high_resolution_clock::now();
std::transform(host_vector.begin(),
host_vector.end(),
host_vector.begin(),
[](float v){ return std::sqrt(v); });
auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
std::cout << "ans: " << ans << std::endl;
std::cout << "time: " << duration.count() << " ms" << std::endl;
return 0;
}
And here's the sample output on my machine (win7 64-bit):
====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU # 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms
My question is: why is the plain (non-opencl) version faster?
As others have said, there is most likely not enough computation in your kernel to make it worthwhile to run on the GPU for a single set of data (you're being limited by kernel compilation time and transfer time to the GPU).
To get better performance numbers, you should run the algorithm multiple times (and most likely throw out the first one as that will be far greater because it includes the time to compile and store the kernels).
Also, instead of running transform() and accumulate() as separate operations, you should use the fused transform_reduce() algorithm which performs both the transform and reduction with a single kernel. The code would look like this:
float ans = 0;
compute::transform_reduce(
device_vector.begin(),
device_vector.end(),
&ans,
compute::sqrt<float>(),
compute::plus<float>(),
queue
);
std::cout << "ans: " << ans << std::endl;
You can also compile code using Boost.Compute with the -DBOOST_COMPUTE_USE_OFFLINE_CACHE which will enable the offline kernel cache (this requires linking with boost_filesystem). Then the kernels you use will be stored in your file system and only be compiled the very first time you run your application (NVIDIA on Linux already does this by default).
I can see one possible reason for the big difference. Compare the CPU and the GPU data flow:-
CPU GPU
copy data to GPU
set up compute code
calculate sqrt calculate sqrt
sum sum
copy data from GPU
Given this, it appears that the Intel chip is just a bit rubbish at general compute, the NVidia is probably suffering from the extra data copying and setting up the GPU to do the calculation.
You should try the same program but with a much more complex operation - sqrt and sum are too simple to overcome the extra overhead of using the GPU. You could try calculating Mandlebrot points for instance.
In your example, moving the lambda into the accumulate would be faster (one pass over memory vs. two passes)
You're getting bad results because you're measuring time incorrectly.
OpenCL Device has it's own time counters, which aren't related to Host counters. Every OpenCL task has 4 states, timers for which can be queried: (from Khronos web site)
CL_PROFILING_COMMAND_QUEUED, when the command identified by event is enqueued in a command-queue by the host
CL_PROFILING_COMMAND_SUBMIT, when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.
CL_PROFILING_COMMAND_START, when the command identified by event starts execution on the device.
CL_PROFILING_COMMAND_END, when the command identified by event has finished execution on the device.
Take into account, that timers are Device-side. So, to measure kernel & command queue performance, you can query for these timers. In your case, 2 last timers are needed.
In your sample code, you're measuring Host time, which includes data transfer time (as Skizz said) plus all time wasted on Command Queue maintenance.
So, to learn actual kernel performance, you need either to pass cl_event to your kernel (no idea how to do it in boost::compute) & query that event for performance counters or make your kernel really huge & complicated to hide all overheads.
I'm trying to measure the time difference between 2 signals on the parallel port, but first i got to know how much accurate and precise is my measuring system (AMD Athlon(tm) 64 X2 Dual Core Processor 5200+ × 2) on SUSE 12.1 x64.
So after some reading i decide to use clock_gettime(), first i get the clock_getres() value using this code:
/*
* This program prints out the clock resolution.
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main( void )
{
struct timespec res;
if ( clock_getres( CLOCK_REALTIME, &res) == -1 ) {
perror( "clock get resolution" );
return EXIT_FAILURE;
}
printf( "Resolution is %ld nano seconds.\n",
res.tv_nsec);
return EXIT_SUCCESS;
}
and the out was: 1 nano second. And i was so happy!!
But here is my problem, when i tried to check that fact with this other code:
#include <iostream>
#include <time.h>
using namespace std;
timespec diff(timespec start, timespec end);
int main()
{
timespec time1, time2, time3,time4;
int temp;
time3.tv_sec=0;
time4.tv_nsec=000000001L;
clock_gettime(CLOCK_REALTIME, &time1);
NULL;
clock_gettime(CLOCK_REALTIME, &time2);
cout<<diff(time1,time2).tv_sec<<":"<<diff(time1,time2).tv_nsec<<endl;
return 0;
}
timespec diff(timespec start, timespec end)
{
timespec temp;
if ((end.tv_nsec-start.tv_nsec)<0) {
temp.tv_sec = end.tv_sec-start.tv_sec-1;
temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
} else {
temp.tv_sec = end.tv_sec-start.tv_sec;
temp.tv_nsec = end.tv_nsec-start.tv_nsec;
}
return temp;
}
this one calculate the time between the two calls of clock_gettime, the time3 and time4 are declared but not used in this example because i was doing tests with them.
The output in this example is fluctuating between 978 and 1467 ns. both numbers are multiples of 489, this make me think that 489 ns is my REAL resolution. far far from the 1 ns obtained above.
My question: is there ANY WAY of getting better results? am i missing something?
I really need at least 10ns resolution for my project. Come on! a GPS can get better resolution than a PC??
I realise this topic is long dead, but wanted to throw in my findings. This is a long answer so I have put the short answer here and those with the patience can wade through the rest. The not-quite-the-answer to the question is 700 ns or 1500 ns depending on which mode of clock_gettime() you used. The long answer is way more complicated.
For reference, the machine I did this work on is an old laptop that nobody wanted. It is an Acer Aspire 5720Z running Ubuntu 14.041 LTS.
The hardware:
RAM: 2.0 GiB // This is how Ubuntu reports it in 'System Settings' → 'Details'
Processor: Intel® Pentium(R) Dual CPU T2330 # 1.60GHz × 2
Graphics: Intel® 965GM x86/MMX/SSE2
I wanted to measure time accurately in an upcoming project and as a relative new comer to PC hardware regardless of operating system, I thought I would do some experimentation on the resolution of the timing hardware. I stumbled across this question.
Because of this question, I decided that clock_gettime() looks like it meets my needs. But my experience with PC hardware in the past has left me under-whelmed so I started fresh with some experiments to see what the actual resolution of the timer is.
The method: Collect successive samples of the result from clock_gettime() and look any patterns in the resolution. Code follows.
Results in a slightly longer Summary:
Not really a result. The stated resolution of the fields in the structure is in nanoseconds. The result of a call to clock_getres() is also tv_sec 0, tv_nsec 1. But previous experience has taught to not trust the resolution from a structure alone. It is an upper limit on precision and reality tends to be a whole lot more complex.
The actual resolution of the clock_gettime() result on my machine, with my program, with my operating system, on one particular day etc turns out to be 70 nanoseconds for mode 0 and 1. 70 ns is not too bad but unfortunately, this is not realistic as we will see in the next point. To complicate matters, the resolution appears to be 7 ns when using modes 2 and 3.
Duration of the clock_gettime() call is more like 1500 ns for modes 0 and 1. It doesn't make sense to me at all to claim 70 ns resolution on the time if it takes 20 times the resolution to get a value.
Some modes of clock_gettime() are faster than others. Modes 2 and 3 are clearly about half the wall-clock time of modes 0 and 1. Modes 0 and 1 are statistically indistinguishable from each other. Modes 2 and 3 are much faster than modes 0 and 1, with mode 3 being the fastest overall.
Before continuing, I better define the modes: Which mode is which?:
Mode 0 CLOCK_REALTIME // reference: http://linux.die.net/man/3/clock_gettime
Mode 1 CLOCK_MONOTONIC
Mode 2 CLOCK_PROCESS_CPUTIME_ID
Mode 3 CLOCK_THREAD_CPUTIME_ID
Conclusion: To me it doesn't make sense to talk about the resolution of the time intervals if the resolution is smaller than the length of time the function takes to get the time interval. For example, if we use mode 3, we know that the function completes within 700 nanoseconds 99% of the time. And we further know that the time interval we get back will be a multiple of 7 nanoseconds. So the 'resolution' of 7 nanoseconds, is 1/100th of the time to do the call to get the time. I don't see any value in the 7 nanosecond change interval. There are 3 different answers to the question of resolution: 1 ns, 7 or 70 ns, and finally 700 or 1500 ns. I favour the last figure.
After all is said and done, if you want to measure the performance of some operation, you need to keep in mind how long the clock_gettime() call takes – that is 700 or 1500 ns. There is no point trying to measure something that takes 7 nanoseconds for example. For the sake of argument, lets say you were willing to live with 1% error on your performance test conclusions. If using mode 3 (which I think I will be using in my project) you would have to say that the interval you need to be measuring needs to be 100 times 700 nanoseconds or 70 microseconds. Otherwise your conclusions will have more than 1% error. So go ahead and measure your code of interest, but if your elapsed time in the code of interest is less that 70 microseconds, then you better go and loop through the code of interest enough times so that the interval is more like 70 microseconds or more.
Justification for these claims and some details:
Claim 3 first. This is simple enough. Just run clock_gettime() a large number of times and record the results in an array, then process the results. Do the processing outside the loop so that the time between clock_gettime() calls is as short as possible.
What does all that mean? See the graph attached. For mode 0 for example, the call to clock_gettime() takes less than 1.5 microseconds most of the time. You can see that mode 0 and mode 1 are basically the same. However, modes 2 and 3 are very different to modes 0 and 1, and slightly different to each other. Modes 2 and 3 take about half the wall-clock time for clock_gettime() compared to modes 0 and 1. Also note that mode 0 and 1 are slightly different to each other – unlike modes 2 and 3. Note that mode 0 and 1 differ by 70 nanoseconds – which is a number which we will come back to in claim #2.
The attached graph is range-limited to 2 microseconds. Otherwise the outliers in the data prevents the graph from conveying the previous point. Something the graph doesn't make clear then is that the outliers for modes 0 and 1 are much worse than the outliers for modes 2 and 3. In other words, not only is the average and the statistical 'mode' (the value which occurs the most) and the median (i.e. the 50th percentile) for all these modes different so is there maximum values and their 99th percentiles.
The graph attached is for 100,001 samples for each of the four modes. Please note that the tests graphed were using a CPU mask of processor 0 only. Whether I used CPU affinity or not didn't seem to make any difference to the graph.
Claim 2: If you look closely at the samples collected when preparing the graph, you soon notice that the difference between the differences (i.e. the 2nd order differences) is relatively constant – at around 70 nanoseconds (fore Modes 0 and 1 at least). To repeat this experiment, collect 'n' samples of clock time as before. Then calculate the differences between each sample. Now sort the differences into order (e.g. sort -g) and then derive the individual unique differences (e.g. uniq -c).
For example:
$ ./Exp03 -l 1001 -m 0 -k | sort -g | awk -f mergeTime2.awk | awk -f percentages.awk | sort -g
1.118e-06 8 8 0.8 0.8 // time,count,cumulative count, count%, cumulative count%
1.188e-06 17 25 1.7 2.5
1.257e-06 9 34 0.9 3.4
1.327e-06 570 604 57 60.4
1.397e-06 301 905 30.1 90.5
1.467e-06 53 958 5.3 95.8
1.537e-06 26 984 2.6 98.4
<snip>
The difference between the durations in the first column is often 7e-8 or 70 nanoseconds. This can become more clear by processing the differences:
$ <as above> | awk -f differences.awk
7e-08
6.9e-08
7e-08
7e-08
7e-08
7e-08
6.9e-08
7e-08
2.1e-07 // 3 lots of 7e-08
<snip>
Notice how all the differences are integer multiples of 70 nanoseconds? Or at least within rounding error of 70 nanoseconds.
This result may well be hardware dependent but I don't actually know what limits this to 70 nanoseconds at this time. Perhaps there is 14.28 MHz oscillator somewhere?
Please note that in practise I use a much larger number of samples such as 100,000, not 1000 as above.
Relevant code (attached):
'Expo03' is the program which calls clock_gettime() as fast as possible. Note that typical usage would be something like:
./Expo03 -l 100001 -m 3
This would call clock_gettime() 100,001 times so that we can compute 100,000 differences. Each call to clock_gettime() in this example would be using mode 3.
MergeTime2.awk is a useful command which is a glorified 'uniq' command. The issue is that the 2nd order differences are often in pairs of 69 and 1 nanosecond, not 70 (for Mode 0 and 1 at least) as I have lead you to believe so far. Because there is no 68 nanosecond difference or a 2 nanosecond difference, I have merged these 69 and 1 nanosecond pairs into one number of 70 nanoseconds. Why the 69/1 behaviour occurs at all is interesting, but treating these as two separate numbers mostly added 'noise' to the analysis.
Before you ask, I have repeated this exercise avoiding floating point, and the same problem still occurs. The resulting tv_nsec as an integer has this 69/1 behaviour (or 1/7 and 1/6) so please don't assume that this is an artefact caused by floating point subtraction.
Please note that I am confident with this 'simplification' for 70 ns and for small integer multiples of 70 ns, but this approach looks less robust for the 7 ns case especially when you get 2nd order differences of 10 times the 7 ns resolution.
percentages.awk and differences.awk attached in case.
Stop press: I can't post the graph as I don't have a 'reputation of at least 10'. Sorry 'bout that.
Rob Watson
21 Nov 2014
Expo03.cpp
/* Like Exp02.cpp except that here I am experimenting with
modes other than CLOCK_REALTIME
RW 20 Nov 2014
*/
/* Added CPU affinity to see if that had any bearing on the results
RW 21 Nov 2014
*/
#include <iostream>
using namespace std;
#include <iomanip>
#include <stdlib.h> // getopts needs both of these
#include <unistd.h>
#include <errno.h> // errno
#include <string.h> // strerror()
#include <assert.h>
// #define MODE CLOCK_REALTIME
// #define MODE CLOCK_MONOTONIC
// #define MODE CLOCK_PROCESS_CPUTIME_ID
// #define MODE CLOCK_THREAD_CPUTIME_ID
int main(int argc, char ** argv)
{
int NumberOf = 1000;
int Mode = 0;
int Verbose = 0;
int c;
// l loops, m mode, h help, v verbose, k masK
int rc;
cpu_set_t mask;
int doMaskOperation = 0;
while ((c = getopt (argc, argv, "l:m:hkv")) != -1)
{
switch (c)
{
case 'l': // ell not one
NumberOf = atoi(optarg);
break;
case 'm':
Mode = atoi(optarg);
break;
case 'h':
cout << "Usage: <command> -l <int> -m <mode>" << endl
<< "where -l represents the number of loops and "
<< "-m represents the mode 0..3 inclusive" << endl
<< "0 is CLOCK_REALTIME" << endl
<< "1 CLOCK_MONOTONIC" << endl
<< "2 CLOCK_PROCESS_CPUTIME_ID" << endl
<< "3 CLOCK_THREAD_CPUTIME_ID" << endl;
break;
case 'v':
Verbose = 1;
break;
case 'k': // masK - sorry! Already using 'm'...
doMaskOperation = 1;
break;
case '?':
cerr << "XXX unimplemented! Sorry..." << endl;
break;
default:
abort();
}
}
if (doMaskOperation)
{
if (Verbose)
{
cout << "Setting CPU mask to CPU 0 only!" << endl;
}
CPU_ZERO(&mask);
CPU_SET(0,&mask);
assert((rc = sched_setaffinity(0,sizeof(mask),&mask))==0);
}
if (Verbose) {
cout << "Verbose: Mode in use: " << Mode << endl;
}
if (Verbose)
{
rc = sched_getaffinity(0,sizeof(mask),&mask);
// cout << "getaffinity rc is " << rc << endl;
// cout << "getaffinity mask is " << mask << endl;
int numOfCPUs = CPU_COUNT(&mask);
cout << "Number of CPU's is " << numOfCPUs << endl;
for (int i=0;i<sizeof(mask);++i) // sizeof(mask) is 128 RW 21 Nov 2014
{
if (CPU_ISSET(i,&mask))
{
cout << "CPU " << i << " is set" << endl;
}
//cout << "CPU " << i
// << " is " << (CPU_ISSET(i,&mask) ? "set " : "not set ") << endl;
}
}
clockid_t cpuClockID;
int err = clock_getcpuclockid(0,&cpuClockID);
if (Verbose)
{
cout << "Verbose: clock_getcpuclockid(0) returned err " << err << endl;
cout << "Verbose: clock_getcpuclockid(0) returned cpuClockID "
<< cpuClockID << endl;
}
timespec timeNumber[NumberOf];
for (int i=0;i<NumberOf;++i)
{
err = clock_gettime(Mode, &timeNumber[i]);
if (err != 0) {
int errSave = errno;
cerr << "errno is " << errSave
<< " NumberOf is " << NumberOf << endl;
cerr << strerror(errSave) << endl;
cerr << "Aborting due to this error" << endl;
abort();
}
}
for (int i=0;i<NumberOf-1;++i)
{
cout << timeNumber[i+1].tv_sec - timeNumber[i].tv_sec
+ (timeNumber[i+1].tv_nsec - timeNumber[i].tv_nsec) / 1000000000.
<< endl;
}
return 0;
}
MergeTime2.awk
BEGIN {
PROCINFO["sorted_in"] = "#ind_num_asc"
}
{array[$0]++}
END {
lastX = -1;
first = 1;
for (x in array)
{
if (first) {
first = 0
lastX = x; lastCount = array[x];
} else {
delta = x - lastX;
if (delta < 2e-9) { # this is nasty floating point stuff!!
lastCount += array[x];
lastX = x
} else {
Cumulative += lastCount;
print lastX "\t" lastCount "\t" Cumulative
lastX = x;
lastCount = array[x];
}
}
}
print lastX "\t" lastCount "\t" Cumulative+lastCount
}
percentages.awk
{ # input is $1 a time interval $2 an observed frequency (i.e. count)
# $3 is a cumulative frequency
b[$1]=$2;
c[$1]=$3;
sum=sum+$2
}
END {
for (i in b) print i,b[i],c[i],(b[i]/sum)*100, (c[i]*100/sum);
}
differences.awk
NR==1 {
old=$1;next
}
{
print $1-old;
old=$1
}
As far as I know, Linux running on a PC will generally not be able to give you timer accuracy in the nanoseconds range. This is mainly due to the type of task/process scheduler used in the kernel. This is as much a result of the kernel as it is of the hardware.
If you need timing with nanosecond resolution I'm afraid that you're out of luck. However you should be able to get micro-second resolution which should be good enough for most scenarios - including your parallel port application.
If you need timing in the nano-seconds range to be accurate to the nano-second you will need a dedicated hardware solution most likely; with a really accurate oscillator (for comparison, the base clock frequency of most x86 CPUs is in the range of mega-hertz before the multipliers)
Finally, if you're looking to replace the functionality of an oscilloscope with your computer that's just not going to work beyond relatively low frequency signals. You'd be much better off investing in a scope - even a simple, portable, hand-held that plugs into your computer for displaying the data.
RDTSCP on your AMD Athlon 64 X2 will give you the time stamp counter with resolution dependent upon your clock. However accuracy is different to resolution, you need to lock thread affinity and disable interrupts (see IRQ routing).
This entails dropping down to assembler or for Windows developers using MSVC 2008 instrinsics.
RedHat with RHEL5 introduced user-space shims that replace gettimeofday with high resolution RDTSCP calls:
http://developer.amd.com/Resources/documentation/articles/Pages/1214200692_5.aspx
https://web.archive.org/web/20160812215344/https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html
Also, check your hardware an AMD 5200 has a 2.6Ghz clock which has 0.4ns interval and the cost of gettimeofday with RDTSCP is 221 cycles that equals 88ns at best.
I'm having trouble getting anything useful from the clock() method in the ctime library in particular situations on my Mac. Specifically, if I'm trying to run VS2010 in Windows 7 under either VMWare Fusion or on Boot Camp, it always seems to return the same value. Some test code to test the issue:
#include <time.h>
#include "iostream"
using namespace std;
// Calculate the factorial of n recursively.
unsigned long long recursiveFactorial(int n) {
// Define the base case.
if (n == 1) {
return n;
}
// To handle other cases, call self recursively.
else {
return (n * recursiveFactorial(n - 1));
}
}
int main() {
int n = 60;
unsigned long long result;
clock_t start, stop;
// Mark the start time.
start = clock();
// Calculate the factorial of n;
result = recursiveFactorial(n);
// Mark the end time.
stop = clock();
// Output the result of the factorial and the elapsed time.
cout << "The factorial of " << n << " is " << result << endl;
cout << "The calculation took " << ((double) (stop - start) / CLOCKS_PER_SEC) << " seconds." << endl;
return 0;
}
Under Xcode 4.3.3, the function executes in about 2 μs.
Under Visual Studio 2010 in a Windows 7 virtual machine (under VMWare Fusion 4.1.3), the same code gives an execution time of 0; this machine is given 2 of the Mac’s 4 cores and 2GB RAM.
Under Boot Camp running Windows 7, again I get an execution time of 0.
Is this a question of being "too far from the metal"?
It could be that the resolution of the timer is not as high under the virtual machine. The compiler can easily convert the tail recursion into a loop; 60 multiplications don't tend to take a terribly long time. Try computing something significantly more costly, like Fibonacci numbers (recursively, of course) and you should see the timer go on.
From time.h included with MSVC,
#define CLOCKS_PER_SEC 1000
which means clock() only has a resolution of 1 millisecond when using the Visual C++ runtime libraries, so any set of operations that takes less than that will almost always be measured as having zero time elapsed.
For higher resolution timing on Windows that can help you, check out QueryPerformanceCounter
and this sample code.