Do I need to prevent preemption while measuring performance - c++

I want to measure the performance of block of code with the use of QueryPerformanceCounter in Windows. What I would like to know is whether between different runs I can do something to get equal measurements for the same data (I want to measure the performance of different sorting algorithms on different sizes of arrays containing pod or some custom objects). I know that the current process can be interrupted from execution because of interrupts or I/O operations. I'm not doing any I/O so it's only interrupts that may affect my measurement, I'm assuming that the kernel also has some time frame that allows my process to run, so I think that's gonna schedule away my proc as well.
How do people make accurate measurements through measuring the time of execution of a specific piece of code?

Time measurements are tricky because you need to find out why your algo is slower. That depends on the input data (e.g. presorted data see Why is it faster to process a sorted array than an unsorted array?) or the data set size (fits into L1,L2,L3 cache see http://igoro.com/archive/gallery-of-processor-cache-effects/).
That can hugely influence your measured times.
Also the order of measurements can play a critical role. If you execute the sort alogs in a loop and each of them allocates some memory the first test will most likely loose. Not because the algorithm is inferior but the first time you access newly allocated memory it will be soft faulted into your process working set. After the memory is freed the heap allocator will return pooled memory which will have an entirely different access performance. That becomes very noticeable if you sort larger (many MB) arrays.
Below are the touch times of a 2 GB array from different threads for the first and second time printed. Each page (4KB) of memory is only touched once.
Threads Size_MB Time_ms us/Page MB/s Scenario
1 2000 355 0.693 5634 Touch 1
1 2000 11 0.021 N.a. Touch 2
2 2000 276 0.539 7246 Touch 1
2 2000 12 0.023 N.a. Touch 2
3 2000 274 0.535 7299 Touch 1
3 2000 13 0.025 N.a. Touch 2
4 2000 288 0.563 6944 Touch 1
4 2000 11 0.021 N.a. Touch 2
// Touch is from the compiler point of view a nop operation with no observable side effect
// This is true from a pure data content point of view but performance wise there is a huge
// difference. Turn optimizations off to prevent the compiler to outsmart us.
#pragma optimize( "", off )
void Program::Touch(void *p, size_t N)
{
char *pB = (char *)p;
char tmp;
for (size_t i = 0; i < N; i += 4096)
{
tmp = pB[i];
}
}
#pragma optimize("", on)
To truly judge the performance of an algorithm it is not sufficient to perform time measurements but you need a profiler (e.g. the Windows Performance Toolkit free, VTune from Intel (not free)) to ensure that you have measured the right thing and not something entirely different.

Just went to a conference with Andrei Alexandrescu on Fastware and he was adressing this exact issue, how to measure speed. Apparently getting the mean is a bad idea BUT, measuring many times is a great idea. So with that in mind you measure a million times and remember the smallest measurement because in fact that's where you would get the least amount of noise.
Means are awful because you're actually adding more of the noise's weight to the actual speed you're measuring (these are not the only things that you should consider when evaluating code speed but this is a good start, there's even more horrid stuff regarding where the code will execute, and the overhead brought by the code starting execution on one core and finishing on another, but that's a different story and I don't think it applies to my sort).
A good joke was: if you put Bill Gates into a bus, on average everybody in that bus is a millionaire :))
Cheers and thanks to all who provided input.

Related

How to test the problem size scaling performance of code

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
{
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
{
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
}
}
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.
The following is how my code scales with problem size S = N x N x N.
Can anybody tell me with the information I've provided if:
it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?
Thanks in advance.
EDIT:
Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
And now the strong and weak scaling plots:
Note: I've measured the BW and it turns out to be 105GB/S.
Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?
Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.
Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.
Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.
A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.
So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.

Time measurement repeatedly makes mistake in specific places

I need to write program which would measure performance of certain data structures. But I can't get reliable result. For example when I measured performance 8 times for the same size of structure, every other result was different(for example: 15ms, 9ms, 15ms, 9ms, 15ms, ...), although the measurements weren't dependent on each other(for every measurement I generated new data). I tried to extract the problem and here is what I have:
while (true) {
auto start = high_resolution_clock::now();
for (int j = 0; j < 500; j++)
;
auto end = high_resolution_clock::now();
cout << duration<double, milli>(end - start).count() << " ";
_getch();
}
What happens when I run this code is - In the first run of loop the time is significantly higher than in next runs. Well it's always higher in the first run, but from time to time also in other measurements.
Example output: 0.006842 0.002566 0.002566 0.002138 0.002993 0.002138 0.002139 ...
And that's the behaviour everytime I start the program.
Here are some things I tried:
It does matter if I compile Release or Debug version. Measurements are still faulty but in different places.
I turned off code optimization.
I tried using different clocks.
And what I think is quite important - While my Add function wasn't empty, the problem depended on data size. For example program worked well for most data sizes but let's say for element count of 7500 measurements were drastically different.
I just deleted part of code after the segment i posted here. And guess what, first measurement is no longer faulty. I have no idea what's happening here.
I would be glad if someone explained to me what can be possible cause of all of this.
In that code, it's likely that you're just seeing the effect of the instruction cache or the micro-op cache. The first time the test is run, more instructions have to be fetched and decoded; on subsequent runs the results of that are available in the caches. As for the alternating times you were setting on some other code, that could be fluctuations in the branch prediction buffer, or something else entirely.
There's too many complex processes involved in execution on modern CPUs to expect a normal sequence of instructions to execute in a fixed amount of time. While it's possible to measure or at least account for these externalities when looking at individual instructions, for nontrivial code you basically have to accept empirical measurements including their variance.
Depending on what kind of operating system you're on, for durations this short, the scheduler can cause huge differences. If your thread is preempted, then you have the idle duration in your time. There are also many things that happen that you don't see: caches, pages, allocation. Modern systems are complex.
You're better off making the whole benchmark bigger, and then doing multiple runs on each thing you're testing, and then using something like ministat from FreeBSD to compare the runs of the same test, and then compare the ministat for the different things you're comparing.
To do this effectively, your benchmark should try to use the same amount of memory as the real workload, so that you memory access is a part of the benchmark.

How to set number of threads in C++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. In my program grainSize is a parameter. Since grainSize or the number of threads which can spawn is a system dependent feature. Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? I work on Linux?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. If anyone is aware of an efficient algorithm then please do share.
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same
This gives you the optimal number of threads (such as the number of cores):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize.
Just replace grainSize by :
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). It may be optimized by trial-and-error, and is potentially larger than 40.
You have at least a bug there:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. Replace by:
multThreadedSort(data + len/2, len-(len/2), grainsize);
Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty), several invocations of std::futureshould already do the job for you, without having to worry.
Note that std::future is something that conceptually runs asynchronously, i.e. it may spawn another thread to execute concurrently. May, not must, mind you.
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait().
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on.
Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value.
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. Thus, even if you assume that the number is correct, it is still not very meaningful at all.
On a more general note
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). A couple of things to consider:
Work groups of 10 are certainly way, way too small. Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. Try "ten thousands", not "tens".
Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as e.g. REP-NOP on Intel). If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes.
Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. May, but needs not be.
Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation.
Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. Just think about what a page fault costs you. During a single page fault, you could have sorted a million elements.
Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. So... there is really no single "correct" rule.
What's the conclusion of the somewhat self-contradicting points above? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. And unluckily, there's no way of knowing with certitude what's best without having measured.
As another thing, consider that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them".
But think about it, what exactly does your approach really do? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. Classic fork-join.
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. That's a lot of traffic.
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).
Ideally, you cannot assume more than n*2 thread running in your system. n is number of CPU cores.
Modern OS uses concept of Hyperthreading. So, now on one CPU at a time can run 2 threads.
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();

How to demonstrate the impact of instruction cache limitations

My orginial idea was to give an elegant code example, that would demonstrate the impact of instruction cache limitations. I wrote the following piece of code, that creates a large amount of identical functions, using template metaprogramming.
volatile int checksum;
void (*funcs[MAX_FUNCS])(void);
template <unsigned t>
__attribute__ ((noinline)) static void work(void) { ++checksum; }
template <unsigned t>
static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); }
template <> void create<0>(void) { }
int main()
{
create<MAX_FUNCS>();
for (unsigned range = 1; range <= MAX_FUNCS; range *= 2)
{
checksum = 0;
for (unsigned i = 0; i < WORKLOAD; ++i)
{
funcs[i % range]();
}
}
return 0;
}
The outer loop varies the amount of different functions to be called using a jump table. For each loop pass, the time taken to invoke WORKLOAD functions is then measured. Now what are the results? The following chart shows the average run time per function call in relation to the used range. The blue line shows the data measured on a Core i7 machine. The comparative measurement, depicted by the red line, was carried out on a Pentium 4 machine. Yet when it comes to interpreting these lines, I seem to be somehow struggling...
The only jumps of the piecewise constant red curve occur exactly where the total memory consumption for all functions within range exceed the capacity of one cache level on the tested machine, which has no dedicated instruction cache. For very small ranges (below 4 in this case) however, run time still increases with the amount of functions. This may be related to branch prediction efficiency, but since every function call reduces to an unconditional jump in this case, I'm not sure if there should be any branching penalty at all.
The blue curve behaves quite differently. Run time is constant for small ranges and increases logarithmic thereafter. Yet for larger ranges, the curve seems to be approaching a constant asymptote again. How exactly can the qualitative differences of both curves be explained?
I am currently using GCC MinGW Win32 x86 v.4.8.1 with g++ -std=c++11 -ftemplate-depth=65536 and no compiler optimization.
Any help would be appreciated. I am also interested in any idea on how to improve the experiment itself. Thanks in advance!
First, let me say that I really like how you've approached this problem, this is a really neat solution for intentional code bloating. However, there might still be several possible issues with your test -
You also measure the warmup time. you didn't show where you've placed your time checks, but if it's just around the internal loop - then the first time until you reach range/2 you'd still enjoy the warmup of the previous outer iteration. Instead, measure only warm performance - run each internal iteration for several times (add another loop in the middle), and take the timestamp only after 1-2 rounds.
You claim to have measure several cache levels, but your L1 cache is only 32k, which is where your graph ends. Even assuming this counts in terms of "range", each function is ~21 bytes (at least on my gcc 4.8.1), so you'll reach at most 256KB, which is only then scratching the size of your L2.
You didn't specify your CPU model (i7 has at least 4 generations in the market now, Haswell, IvyBridge, SandyBridge and Nehalem). The differences are quite large, for example an additional uop-cache since Sandybrige with complicated storage rules and conditions. Your baseline is also complicating things, if I recall correctly the P4 had a trace cache which might also cause all sorts of performance impacts. You should check an option to disable them if possible.
Don't forget the TLB - even though it probably doesn't play a role here in such a tightly organized code, the number of unique 4k pages should not exceed the ITLB (128 entries), and even before that you may start having collisions if your OS did not spread the physical code pages well enough to avoid ITLB collisions.

The overhead of a OpenCL or CUDA call?

I'm writing a function that does a lot of BLAS gemv operations.
I would like to be able to do this on the GPU, and I've tried with cuBlas.
My problem is that my matrix's and vectors are rather small, 100x100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.
Therefore I'm trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.
That is the time it takes CUDA to setup the call and send it to the graphics processor -- not counting the time it actually takes to do the matrix-vector multiplication.
How would I go about doing this?
Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks
The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.
N FFTw (ms) GPUFFT (ms) GPUFFT MFLOPS GPUFFT Speedup
8 0 0.06 3.352705 0.006881
16 0.001 0.065 7.882117 0.010217
32 0.001 0.075 17.10887 0.014695
64 0.002 0.085 36.080118 0.026744
128 0.004 0.093 76.724324 0.040122
256 0.007 0.107 153.739856 0.066754
512 0.015 0.115 320.200892 0.134614
1024 0.034 0.125 657.735381 0.270512
2048 0.076 0.156 1155.151507 0.484331
4096 0.173 0.215 1834.212989 0.804558
8192 0.483 0.32 2664.042421 1.510011
16384 1.363 0.605 3035.4551 2.255411
32768 3.168 1.14 3450.455808 2.780041
65536 8.694 2.464 3404.628083 3.528726
131072 15.363 5.027 3545.850483 3.05604
262144 33.223 12.513 3016.885246 2.655183
524288 72.918 25.879 3079.443664 2.817667
1048576 173.043 76.537 2192.056517 2.260904
2097152 331.553 157.427 2238.01491 2.106081
4194304 801.544 430.518 1715.573229 1.861814
The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.
If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.
In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10x100 rather than 100x10).
Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).
Hope this helps,
You can get the time in nanoseconds from the device when an event was queued, submitted, started, and finished by using clGetEventProfilingInfo on your buffer transfer event.
more info, and how to set it up here: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html
I think that for 100x100 matrices, you may be better off sticking to cpu for the crunching. Unless you have many to multiply at the same time, the benefit of the gpu will be hardly noticeable due to the (small) transfer overhead and usually much lower clock speeds. Make sure you tweak your kernel to use as much of the local data as possible - on my hardware, there is 32KB per work group, and that should be plenty to hold two 100x100 matrices. The built-in dot product functions should also be very handy too.
There was an awesome talk about this at ADFS last year (see sessionId: 2908)
http://developer.amd.com/afds/pages/OLD/sessions.aspx
They talk in detail about optimizing the kernel, and hard-coding the optimal sizes.
Are your matrices already on the GPU?
If not, CUBLAS might transfer them for you (known as thunking), which is an additional overhead.
Also, GPUs do not really shine for such small computations, i.e. it will probably be slower than CPUs since you have to transfer your result back.
If you can, use bigger matrices.
Otherwise you might want to use streams (cudaStream_t) to start multiple parallel computations on the GPU.
If you want to measure the execution time of a kernel in CUDA, you need to enclose that (or anything else that computes on the GPU) in events, like this when using the CUDA runtime API:
cudaEvent_t start, stop;
cudaEventRecord(&start);
struct timeval cpuStart, cpuEnd;
gettimeofday(&cpuStart, 0); // get start time on CPU
// Do something with CUDA on the GPU, e.g. call kernels, transfer memory, ...
gettimeofday(&cpuEnd, 0); // get end time on CPU
double seconds = cpuEnd.tv_sec - cpuStart.tv_sec;
double microseconds = cpuEnd.tv_usec - cpuStart.tv_usec;
double cpuDuration = (seconds * 1.0e6 + microseconds) / 1.0e3; // in milliseconds
cudaEventRecord(&stop);
// Wait until the stop event occurred
cudaError_t eventResult;
do
{
eventResult = cudaEventQuery(stop);
}
while (eventResult == cudaErrorNotReady);
// Assert there was no error; check the CUDA Toolkit Reference for further info
assert(cudaSuccess == eventResult); // requires #include <assert.h> or <cassert>
// Retrieve the time
float gpuDuration = 0.0; // in milliseconds
cudaEventElapsedTime(&gpuDuration, start, stop);
// Release the event objects
cudaEventDestroy(stop);
cudaEventDestroy(start);
You might want to check the error code of every call to CUDA (at least with an assert), as you may get errors from previous calls, resulting in hours of debugging...
(Note: I mostly use the CUDA driver API, so this might not work out of the box. Sorry for that.)
EDIT: Just saw that you want to measure the call itself, not the duration of the kernel.
You can do that by simply measuring the time on the CPU for the call - see the updated code above.
This works only on Linux because gettimeofday is not available for Windows (AFAIK).
To find the call overhead, call a CUDA kernel that does as little as possible.
for (int i=0; i<NLoops; i++) {
gettimeofday(&cpuStart, 0); // get start time on CPU
// Call minimal CUDA kernel
gettimeofday(&cpuEnd, 0); // get end time on CPU
// save elapsed time
}
Follow the code of Alex P. above.
The less processing you do in the kernel, the more the time difference will be only the call overhead.
Do a little experimenting to find a good value for NLoops (maybe 1,000,000). Be sure that the elapsed time is longer than the interval of your timer, or you'll end up with all zeros. If that happens, write some kernel code that executes in a fixed time interval that you can predict: (n loops of x cycles each).
It's hard to remove all the non-CUDA computations that might occur between cpuStart and cpuEnd (like interrupt processing), but making several runs and averaging can give good results.