Measuring execution time of OpenCL kernels - profiling

I have the following loop that measures the time of my kernels:
double elapsed = 0;
cl_ulong time_start, time_end;
for (unsigned i = 0; i < NUMBER_OF_ITERATIONS; ++i)
{
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, NULL, 0, NULL, &event); checkErr(err, "Kernel run");
err = clWaitForEvents(1, &event); checkErr(err, "Kernel run wait fro event");
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL); checkErr(err, "Kernel run get time start");
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL); checkErr(err, "Kernel run get time end");
elapsed += (time_end - time_start);
}
Then I divide elapsed by NUMBER_OF_ITERATIONS to get the final estimate. However, I am afraid the execution time of individual kernels is too small and hence can introduce uncertainty into my measurement. How can I measure the time taken by all NUMBER_OF_ITERATIONS kernels combined?
Can you suggest a profiling tool, which could help with this, as I do not need to access this data programmatically. I use NVIDIA's OpenCL.

you need follow next steps to measure the execution time of OpenCL kernel execution time:
Create a queue, profiling need been enable when the queue is created:
cl_command_queue command_queue;
command_queue = clCreateCommandQueue(context, devices[deviceUsed], CL_QUEUE_PROFILING_ENABLE, &err);
Link an event when launch a kernel
cl_event event;
err=clEnqueueNDRangeKernel(queue, kernel, woridim, NULL, workgroupsize, NULL, 0, NULL, &event);
Wait for the kernel to finish
clWaitForEvents(1, &event);
Wait for all enqueued tasks to finish
clFinish(queue);
Get profiling data and calculate the kernel execution time (returned by the OpenCL API in nanoseconds)
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f milliseconds \n",nanoSeconds / 1000000.0);

The profiling function returns nano seconds, and is very accurate (~50ns), however, the execution has different execution times, depending on others minor issues you can't control.
This reduces your problematic on what you want to measure:
Measuring the kernel execution time: Your approach is correct, the accuracy of the average execution time measured will increase as you increase N. This accounts only for the execution time, no overhead taken in consideration.
Measuring the kernel execution time + overhead: You should use events as well but measure since CL_PROFILING_COMMAND_SUBMIT, to account for extra execution overhead.
Measuring the real host side execution time: You should use events as well but measure since the first event start, to the last event end. Using CPU timing measurement is another possibility. If you want to measure this, then you should remove the waitforevents from the loop, to allow maximum throughput to the OpenCL system (and less overhead possible).
Answering the Tools question, I recomend to use nVIDIA visual profiler. BUt since is no longer available for OpenCL, you should use the Visual Studio Add on or an old version (CUDA 3.0) of the nvprofiler.

The time that is measured is returned in nanoseconds, but you're right: The resolution of the timer is lower. However, I wonder what the actual execution time of your kernel is when you say that the time is too short to measure accurately (my gut feeling is that the resolution should be in the range of microseconds).
The most appropriate way of measuring the total time of multiple iterations depends on what "multiple" means here. Is NUMBER_OF_ITERATIONS=5 or NUMBER_OF_ITERATIONS=500000? If the number of iterations is "large", you may just use the system clock, possibly with OS-specific functions like QueryPerformanceCounter on windows (also see, for example, Is there a way to measure time up to micro seconds using C standard library? ), but of course, the precision of the system clock might be lower than that of the OpenCL device, so whether this makes sense really depends on the number of iterations.
It's a pity that NVIDIA removed OpenCL support from their Visual Profiler, though...

On Intel's OpenCL GPU implementation I've been successful with your approach (timing per kernel) and prefer it to batching a stream of NDRanges.
An alternate approach is to run it N times with and measure time with markers events as in the approach proposed in this question (the question not the answer).
Times for short kernels are generally at least in the microseconds realm in my experience.
You can check the timer resolution using clGetDeviceInfo with CL_DEVICE_PROFILING_TIMER_RESOLUTION (e.g. 80 ns on my setup).

Related

Monitor task CPU utilization in VxWorks while program is running

I'm running a VxWorks 6.9 OS embedded system and I need to see when I'm starving low priority tasks. Ideally I'd like to have CPU utilization by task so I know what is eating up all my CPU time.
I know this is a built in feature in many operating systems but have been so far unable to find it for VxWorks 6.9.
If I can't measure by task I'd like to at least to see what percentage of time the CPU is idle.
To that end I've been trying to make a lowest priority task that will run the function below that would try to measure it indirectly.
float Foo::IdleTime(Foo* f)
{
bool inIdleTask;
float timeIdle;
float totalTime;
float percentIdle;
while(true)
{
startTime = _time(); //get time before before measurement starts
inIdleTask = true;
timeIdle = 0;
while(inIdleTask) // I have no clue how to detect when the task left and set this to false
{
timeIdle += (amount_of_time_for_inner_loop); //measure idle time
}
returnTime = _time(); //get time after you return to IdleTime task
totalTime = ( returnTime - startTime );
percentIdle = ( timeIdle / totalTime ) * 100; //calculate percentage of idle time
//logic to report percentIdle
}
The big problem with this concept is I don't know how I would detect when this task is left for a higher priority task.
If you are looking for a one time measurement done during the developement, then spyLib is what you are looking for. Simply call spy from the command line to get per task CPU usage report in 10s intervals. Call spyHelp to learn how to configure the spy. (Might need to inculude the spyLib to kernel if not already included.)
If you want to go the extra mile, taskHookLib is what you need. Simply put, you hook a function to be called in every task switch. Call gives you the TASK_IDs of tasks going in and out of the CPU. You can either simply monitor the starvation of low pri tasks or take action and increase their priority temporarily.
From experience, spy adds a little performance overhead, especially if stdout faces to a slow I/O (e.g. a 9600 baud serial), but fairly easy to use. taskHook'ing adds little to none overhead if you are not immediately printing the results on the terminal, but takes a bit of programming to get it running.
Another thing that might be of interest is WindRiver's remote debugger. Haven't use that one personally, imagine it would require setting up the workbench and the target properly.

Posix Timer on SCHED_RR Thread is using 100% CPU

I have the following code snippet:
#include <iostream>
#include <thread>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/timerfd.h>
int main() {
std::thread rr_thread([](){
struct sched_param params = {5};
pthread_setschedparam(pthread_self(), SCHED_RR, &params);
struct itimerspec ts;
struct epoll_event ev;
int tfd ,epfd;
ts.it_interval.tv_sec = 0;
ts.it_interval.tv_nsec = 0;
ts.it_value.tv_sec = 0;
ts.it_value.tv_nsec = 20000; // 50 kHz timer
tfd = timerfd_create(CLOCK_MONOTONIC, 0);
timerfd_settime(tfd, 0, &ts, NULL);
epfd = epoll_create(1);
ev.events = EPOLLIN;
epoll_ctl(epfd, EPOLL_CTL_ADD, tfd, &ev);
while (true) {
epoll_wait(epfd, &ev, 1, -1); // wait forever for the timer
read(tfd, &missed, sizeof(missed));
// Here i have a blocking function (dummy in this example) which
// takes on average 15ns to execute, less than the timer period anyways
func15ns();
}
});
rr_thread.join();
}
I have a posix thread using the SCHED_RR policy and on this thread there is a POSIX Timer running with a timeout of 20000ns = 50KHz = 50000 ticks/sec.
After the timer fires i am executing a function that takes roughly 15ns so less than the timer period, but this doesn't really matter.
When i execute this i am getting 100% CPU Usage, the whole system is getting slow but i don't understand why this is happening and some things are confusing.
Why 100% CPU Usage since the thread is supposed to be sleeping while waiting for the timer to fire, so other tasks can be scheduled in theory right? even if this is a high priority thread.
I checked using pidstat the number of context switches and it seems that it's very small, close to 0, both voluntary and involuntary ones. Is this normal? While waiting for the timer to fire the scheduler should schedule other tasks right? I should see at least 20000 * 2 context switches / sec
As presented, your program does not behave as you describe. This is because you program the timer as a one-shot, not a repeating timer. For a timer that fires every 20000 ns, you want to set a 20000-ns interval:
ts.it_interval.tv_nsec = 20000;
Having modified that, I get a program that works produces heavy load on one core.
Why 100% CPU Usage since the thread is supposed to be sleeping while waiting for the timer to fire, so other tasks can be scheduled
in theory right? even if this is a high priority thread.
Sure, your thread blocks in epoll_wait() to await timer ticks, if in fact it manages to loop back there before the timer ticks again. On my machine, your program consumes only about 30% of one core, which seems to confirm that such blocking will indeed happen. That you see 100% CPU use suggests that my computer runs the program more efficiently than yours does, for whatever reason.
But you have to appreciate that the load is very heavy. You are asking to perform all the processing of the timer itself, the epoll call, the read, and func15ns() once every 20000 ns. Yes, whatever time may be left, if any, is available to be scheduled for another task, but the task swap takes a bit more time again. 20000 ns is not very much time. Consider that just fetching a word from main memory costs about 100 ns (though reading one from cache is of course faster).
In particular, do not neglect the work other than func15ns(). If the latter indeed takes only 15 ns to run then it's the least of your worries. You're performing two system calls, and these are expensive. Just how expensive depends on a lot of factors, but consider that removing the epoll_wait() call reduces the load for me from 30% to 25% of a core (and note that the whole epoll setup is superfluous here because simply allowing the read() to block serves the purpose).
I checked using pidstat the number of context switches and it seems that it's very small, close to 0, both voluntary and involuntary
ones. Is this normal? While waiting for the timer to fire the
scheduler should schedule other tasks right? I should see at least
20000 * 2 context switches / sec
You're occupying a full CPU with a high priority task, so why do you expect switching?
On the other hand, I'm also observing a low number of context switches for the process running your (modified) program, even though it's occupying only 25% of a core. I'm not prepared at the moment to reason about why that is.

How to make comparable timemeasurement in cuda and c++ code

I have a cuda and a cpp implementation of the same algorithm. In CUDA I make the timemeasurement with events:
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0); // start time measurement
// some cuda stuff
cudaEventRecord(stop, 0); // stop time measurement
cudaEventSynchronize(stop); // sync results
cudaEventElapsedTime(&time, start, stop);
printf ("Elapsed time : %f ms\n", time);
In c++ I measure with timeofday:
struct timeval start, end;
long seconds, useconds;
float mseconds;
gettimeofday(&start, NULL);
// some work to do
gettimeofday(&end, NULL);
seconds = end.tv_sec - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;
mseconds = (seconds * 1000 + useconds/1000.0) + 0.5;
printf ("Elapsed time : %f ms\n", mseconds);
Is this the correct way to get good- comparable results?
Thanks in advance!
Yes, this is a good way to get CPU-vs-GPU time comparisons.
There are multiple ways to get CPU timings, of course, ranging from high-resolution system timers to __rdtsc intrinsics. But for such a coarse comparison either should work just fine.
If you want to dig deeper into your GPU performance and look for potential areas of improvement, you may want to look at the command-line CUDA profiler nvprof, or at the Visual Profiler, which does the same thing but also has a GUI.
If you simply want to compare the whole execution time of your CUDA-related stuff, you can keep your C++ time measurements. Just ensure your device has finished every single task it had before checking elapsed time :
gettimeofday(&start, NULL);
// some work to do
cudaDeviceSynchronize();
gettimeofday(&end, NULL);
This is a simple way to compute how much time your tasks took on device side with CUDA compared to CPU side.
As suggested by ApoorvaJ, if you need to go deeper into CUDA performance to check where are the device bottlenecks, you can use the Visual Profiler. If you are using Visual Studio, check these steps I wrote for another SO user who wanted to check the PTX code. You just have to explore the other data the Visual Profiler can provide, and there is a lot !
Check the Profiler section on the official CUDA documentation from Nvidia.

Best option to profile CPU use in my program?

I am profiling CPU usage on a simple program I am writing. I have different algorithms I want to try, and I also want to know what's the impact on the total system performance.
Currently, I am using ualarm() to execute some instructions at 30Hz; every 15 of those interruptions (every 0.5s) I record the CPU time with getrusage() (in useconds), so I have an estimation on the total cpu time of cpu consumption on that point in time. But to get context, I also need to know the total time elapsed in the system in that time period, so I can have the % of which is used by my program.
/* Main Loop */
while(1)
{
alarm = 0;
/* Waiting Loop: */
for(i=0; !alarm; i++){
}
count++;
/* Do my things */
/* Check if it's time to store cpu log: */
if ((count%count_max) == 0)
{
getrusage(RUSAGE_SELF, &ru);
store_cpulog(f,
(int64_t) ru.ru_utime.tv_sec,
(int64_t) ru.ru_utime.tv_usec,
(int64_t) ru.ru_stime.tv_sec,
(int64_t) ru.ru_stime.tv_usec);
}
}
I have different options, but I don't know which one will provide the most exact result:
Use ualarm for the timing. Currently it's programmed to signal every 0.5 seconds, so I can take those 0.5 seconds as the CPU time. Seems quite obvious to use, but it's the best option?
Use clock_gettime(CLOCK_MONOTONIC): it provides readings with a nanosec resolution.
Use gettimeofday(): provides readings with a usec resolution. I've found opinions against using it.
Any recommendation? Thanks.
Possible solution is to use system function time and don't using busy loop (like #Hasturkun say) in your program. Call in console:
time /path/to/my/program
and after execution of it you get something like:
real 0m1.465s
user 0m0.000s
sys 0m1.210s
Not sure about precision, if it is enough for you.
Callgrind is possibly the best application for profiling C/C++ code under linux. Use it with pride:)

How to get an accurate 1ms Timer Tick under WinXP

I try to call a function every 1 ms. The problem is, I like to do this with windows. So I tried the multimediatimer API.
Multimediatimer API
Source
idTimer = timeSetEvent(
1,
0,
TimerProc,
0,
TIME_PERIODIC|TIME_CALLBACK_FUNCTION );
My result was that most of the time the 1 ms was ok, but sometimes I get the double period. See the little bump at around 1.95ms
multimediatimerHistogram http://www.freeimagehosting.net/uploads/8b78f2fa6d.png
My first thought was that maybe my method was running too long. But I measured this already and this was not the case.
Queued Timers API
My next try was using the queud timers API with
hTimerQueue = CreateTimerQueue();
if(hTimerQueue == NULL)
{
printf("Error creating queue: 0x%x\n", GetLastError());
}
BOOL res = CreateTimerQueueTimer(
&hTimer,
hTimerQueue,
TimerProc,
NULL,
0,
1, // 1ms
WT_EXECUTEDEFAULT);
But also the result was not as expected. Now I get most of the time 2 ms cycletime.
queuedTimer http://www.freeimagehosting.net/uploads/2a46259a15.png
Measurement
For measuring the times I used the method QueryPerformanceCounter and QueryPerformanceFrequency.
Question
So now my question is if somebody encountered similar problems under windows and maybe even found a solution?
Thanks.
Without going to a real-time OS, you cannot expect to have your function called every 1 ms.
On Windows that is NOT a real-time OS (for Linux it is similar), a program that repeatedly read a current time with microsecond precision, and store consecutive differences in an histogram have a non-empty bin for >10 ms! This means that sometimes you will have 2 ms, but you can also get more between your calls.
You can try to run timeBeginPeriod(1) at the program start and timeEndPeriod(1) before quitting. This probably can enhance timer precision.
A call to NtQueryTimerResolution() will return a value for ActualResolution. In your case the actual resolution is almost certainly 0.9765625 ms. This is exactly what you show in the first plot.
The second occurace of about 1.95 ms is more precisely Sleep(1) = 1.9531 ms = 2 x 0.9765625 ms
I guess the interrupt period runs at someting close to 1ms (0.9765625).
And now the trouble begins: The timer signals when the desired delay expires.
Say the ActualResolution is set to 0.9765625, the interrupt heartbeat of the system will run at 0.9765625 ms periods or 1024 Hz and a call to Sleep is made with a desired delay of 1 ms. Two scenarios are to be looked at:
The call was made < 1ms (ΔT) ahead of the next interrupt. The next interrupt will not confirm that the desired period of time has expired. Only the following interrupt will cause the call to return. The resulting sleep delay will be ΔT + 0.9765625 ms.
The call was made >= 1ms (ΔT) ahead of the next interrupt. The next interrupt will force the call to return. The resulting sleep delay will be ΔT.
So the result depends a lot on when the call was made and therefore you may observe 0.98ms events as well as 1.95ms events.
Edit: Using the CreateTimerQueueTimer will push the observed delay to 1.95 because the timer tick (interrupt period) is 0.9765625 ms. On the first occurence of the interrupt, the requested duration of 1 ms has not quite expired, thus the TimerProc will only be triggered after the second interrupt (2 x 0.9765625 ms = 1.953125 ms > 1 ms). Consequently, the queueTimer plot shows the peak at 1.953125 ms.
Note: This behavior strongly depends on the underlying hardware.
More details can be found at the Windows Timestamp Project