How to make simple profiling tool for OpenCL programs? - profiling

I have a task to make a simple profiling tool (winOS) for performance/timing/event analysis of OpenCL programs. Can someone give advice how to start?

The simplest one and works accurately on all platforms:
cl_event perfEvent;
cl_ulong start=0, end=0;
float t_kernel;
/* Enqueue kernel */
clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, &perfEvent);
clWaitForEvents( 1, &perfEvent );
/* Get the execution time */
clGetEventProfilingInfo(perfEvent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(perfEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
t_kernel = (end-start)/1000000.0f;
std::cout << t_kernel << std::endl;

Take a look at AMD CodeXL. It's free and it might be just what you are looking for. Inside CodeXL, use the Application Timeline Trace mode (Profile -> Application Timeline Trace), which executes a program and generates a visual timeline that displays OpenCL events like kernel dispatch and data transfer operations.

Related

OpenCL event-based timer (c++ wrapper)

I am new to OpenCL and met some problems measuring kernel runtime.
I use a c++ wrapper to deal with event and profiling.
context = Context(device);
queue = CommandQueue(context, device);
/** do something */
cl_ulong time_end, time_start, time_total;
Event event;
// Launch the kernel
queue.enqueueNDRangeKernel(kernel, NULL, global_work_size, local_work_size, NULL, &event);
queue.finish();
// Get event info and print GPU runtime
event.wait();
event.getProfilingInfo(CL_PROFILING_COMMAND_START, &time_start);
event.getProfilingInfo(CL_PROFILING_COMMAND_END, &time_end);
time_total = time_end - time_start;
printf("%lu\t%lu\t", time_total);
printf("\nRendering time (ms): \t%lu\n\n", time_total);
I got result which is obviously not in the scale it should be.
6052843157020279026 140734592538400
Rendering time (ms): 12394041651281810990
Then I use normal timer provided by sys/time.h and got 0.02s.
Did I miss something critical in my code? Something like not having synchronized or invalid initialization? Thanks!
Command Queues need to be created with the express request to include profiling information, or their profiling commands won't work correctly.
I don't recognize the API you're using to create OpenCL objects, but this is what that would look like in the native OpenCL API:
//OpenCL versions 1.1-
cl_command_queue queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
//OpenCL versions 1.2+
cl_command_queue_properties properties[] {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0};
cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, properties, &err);

How to properly use a hardware accelerated Media Foundation Source Reader to decode a video?

I'm in the process of writing a hardware accelerated h264 decoder using Media Foundation's Source Reader, but have encountered a problem. I followed this tutorial and supported myself with Windows SDK Media Foundation samples.
My app seems to work fine when hardware acceleration is turned off, but it doesn't provide the performance I need. When I turn the acceleration on by passing a IMFDXGIDeviceManager to IMFAttributes used to create the reader, things get complicated.
If I create the ID3D11Device using a D3D_DRIVER_TYPE_NULL driver, the app works fine and the frames are processed faster that in the software mode, but judging by the CPU and GPU usage it still does majority of the processing on CPU.
On the other hand, when I create the ID3D11Device using a D3D_DRIVER_TYPE_HARDWARE driver and run the app, one of these four things can happen.
I only get an unpredictable number of frames (usually 1-3) before IMFMediaBuffer::Lock function returns 0x887a0005 which is described as "The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action". When I call ID3D11Device::GetDeviceRemovedReason, I get 0x887a0020 which is described as "The driver encountered a problem and was put into the device removed state" which isn't as helpful as I wish it to be.
The app crashes in an external dll on IMFMediaBuffer::Lock call. It seems that the dll depends on the GPU used. For Intel integrated GPU it's igd10iumd32.dll and for Nvidia mobile GPU it's mfplat.dll. The message for this particular crash is as follows: "Exception thrown at 0x53C6DB8C (mfplat.dll) in decoder_ tester.exe: 0xC0000005: Access violation reading location 0x00000024". The addresses are different between executions and sometimes it involves reading, sometimes writing.
The graphics driver stops responding, the system hangs for a short time and then the application crashes like in point 2 or finishes like in point 1.
The app works fine and processes all the frames with hardware acceleration.
Most of the time it's 1 or 2, seldom 3 or 4.
Here's what the CPU/GPU usage is like when processing without throttling in different modes on my machine (Intel Core i5-6500 with HD Graphics 530, Windows 10 Pro).
NULL - CPU: ~90%, GPU: ~15%
HARDWARE - CPU: ~15%, GPU: ~60%
SOFTWARE - CPU: ~40%, GPU: ~7%
I tested the app on three machines. All of them had Intel integrated GPUs (HD 4400, HD 4600, HD 530). One of them also had switchable Nvidia dedicated GPU (GF 840M). It bahaves identically on all of them, the only difference is that it crashes in a different dll when Nvidia's GPU is used.
I have no previous experience with COM or DirectX, but all of this is inconsistent and unpredictable, so it looks like a memory corruption to me. Still, I don't know where I'm making the mistake. Could you please help me find what I'm doing wrong?
The minimal code example I could come up with with is below. I'm using Visual Studio Professional 2015 to compile it as a C++ project. I prepared definitions to enable hardware acceleration and select the hardware driver. Comment them out to change the behavior. Also, the code expects this video file to be present in the project directory.
#include <iostream>
#include <string>
#include <atlbase.h>
#include <d3d11.h>
#include <mfapi.h>
#include <mfidl.h>
#include <mfreadwrite.h>
#include <windows.h>
#pragma comment(lib, "d3d11.lib")
#pragma comment(lib, "mf.lib")
#pragma comment(lib, "mfplat.lib")
#pragma comment(lib, "mfreadwrite.lib")
#pragma comment(lib, "mfuuid.lib")
#define ENABLE_HW_ACCELERATION
#define ENABLE_HW_DRIVER
void handle_result(HRESULT hr)
{
if (SUCCEEDED(hr))
return;
WCHAR message[512];
FormatMessage(FORMAT_MESSAGE_FROM_SYSTEM | FORMAT_MESSAGE_IGNORE_INSERTS, nullptr, hr,
MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT), message, ARRAYSIZE(message), nullptr);
printf("%ls", message);
abort();
}
int main(int argc, char** argv)
{
handle_result(CoInitializeEx(nullptr, COINIT_APARTMENTTHREADED | COINIT_DISABLE_OLE1DDE));
handle_result(MFStartup(MF_VERSION));
{
CComPtr<IMFAttributes> attributes;
handle_result(MFCreateAttributes(&attributes, 3));
#if defined(ENABLE_HW_ACCELERATION)
CComPtr<ID3D11Device> device;
D3D_FEATURE_LEVEL levels[] = { D3D_FEATURE_LEVEL_11_1, D3D_FEATURE_LEVEL_11_0 };
#if defined(ENABLE_HW_DRIVER)
handle_result(D3D11CreateDevice(nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr, D3D11_CREATE_DEVICE_SINGLETHREADED | D3D11_CREATE_DEVICE_VIDEO_SUPPORT,
levels, ARRAYSIZE(levels), D3D11_SDK_VERSION, &device, nullptr, nullptr));
#else
handle_result(D3D11CreateDevice(nullptr, D3D_DRIVER_TYPE_NULL, nullptr, D3D11_CREATE_DEVICE_SINGLETHREADED,
levels, ARRAYSIZE(levels), D3D11_SDK_VERSION, &device, nullptr, nullptr));
#endif
UINT token;
CComPtr<IMFDXGIDeviceManager> manager;
handle_result(MFCreateDXGIDeviceManager(&token, &manager));
handle_result(manager->ResetDevice(device, token));
handle_result(attributes->SetUnknown(MF_SOURCE_READER_D3D_MANAGER, manager));
handle_result(attributes->SetUINT32(MF_READWRITE_ENABLE_HARDWARE_TRANSFORMS, TRUE));
handle_result(attributes->SetUINT32(MF_SOURCE_READER_ENABLE_ADVANCED_VIDEO_PROCESSING, TRUE));
#else
handle_result(attributes->SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, TRUE));
#endif
CComPtr<IMFSourceReader> reader;
handle_result(MFCreateSourceReaderFromURL(L"Rogue One - A Star Wars Story - Trailer.mp4", attributes, &reader));
CComPtr<IMFMediaType> output_type;
handle_result(MFCreateMediaType(&output_type));
handle_result(output_type->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video));
handle_result(output_type->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_RGB32));
handle_result(reader->SetCurrentMediaType(MF_SOURCE_READER_FIRST_VIDEO_STREAM, nullptr, output_type));
unsigned int frame_count{};
std::cout << "Started processing frames" << std::endl;
while (true)
{
CComPtr<IMFSample> sample;
DWORD flags;
handle_result(reader->ReadSample(MF_SOURCE_READER_FIRST_VIDEO_STREAM,
0, nullptr, &flags, nullptr, &sample));
if (flags & MF_SOURCE_READERF_ENDOFSTREAM || sample == nullptr)
break;
std::cout << "Frame " << frame_count++ << std::endl;
CComPtr<IMFMediaBuffer> buffer;
BYTE* data;
handle_result(sample->ConvertToContiguousBuffer(&buffer));
handle_result(buffer->Lock(&data, nullptr, nullptr));
// Use the frame here.
buffer->Unlock();
}
std::cout << "Finished processing frames" << std::endl;
}
MFShutdown();
CoUninitialize();
return 0;
}
Your code is correct, conceptually, with the only remark - and it's not quite obvious - that Media Foundation decoder is multithreaded. You are feeding it with a single threaded version of Direct3D device. You have to work it around or you get what you are currently getting: access violations and freezes, that is undefined behavior.
// NOTE: No single threading
handle_result(D3D11CreateDevice(nullptr, D3D_DRIVER_TYPE_HARDWARE, nullptr,
(0 * D3D11_CREATE_DEVICE_SINGLETHREADED) | D3D11_CREATE_DEVICE_VIDEO_SUPPORT,
levels, ARRAYSIZE(levels), D3D11_SDK_VERSION, &device, nullptr, nullptr));
// NOTE: Getting ready for multi-threaded operation
const CComQIPtr<ID3D11Multithread> pMultithread = device;
pMultithread->SetMultithreadProtected(TRUE);
Also note that this straightforward code sample has a performance bottleneck around the lines you added for getting contiguous buffer. Apparently it's your move to get access to the data... however behavior by design is that decoded data is already in video memory, and your transfer to system memory is an expensive operation. That is, you added a severe performance hit to the loop. You will be interested in checking validity of data this way, and when it comes to performance benchmarking you should rather comment that out.
The output types of H264 video decoder can be found here: https://msdn.microsoft.com/en-us/library/windows/desktop/dd797815(v=vs.85).aspx.
RGB32 is not one of them. In this case your app relies on the Video Processor MFT to do the conversion from any of the MFVideoFormat_I420, MFVideoFormat_IYUV, MFVideoFormat_NV12, MFVideoFormat_YUY2, MFVideoFormat_YV12 to RGB32. I suppose that it's the Video Processor MFT that acts strangely and causes your program to misbehave. That's why by setting NV12 as the output subtype for the decoder you'll get rid of the Video Processor MFT and the following lines of code are getting useless as well:
handle_result(attributes->SetUINT32(MF_SOURCE_READER_ENABLE_ADVANCED_VIDEO_PROCESSING, TRUE));
and
handle_result(attributes->SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, TRUE));
Moreover, as you noticed NV12 is the only format that works properly. I think the reason for this is that it is the only one that is used in the accelerated scenarios by the D3D and DXGI device manager.

How to obtain handles for all children process of current process in Windows?

For the purposes of performance monitoring on Windows OS, I need a program which can report both user and kernel times for an arbitrary process. On POSIX systems, standard time utility is perfectly OK as it reports wall clock time, user time and kernel time.
For Windows, there is no such utility by default. I looked around and found at least three alternatives. As I explain below, none of them actually suits my needs.
timeit from Windows SDK (cannot recall what exact version). It is no longer distributed, supported, or guaranteed to work on modern systems. I was not able to test it.
Cygwin's time. Almost identical to POSIX counterpart with similar output formatting.
timep.exe by Johnson (John) Hart, available in source code and binaries for his book "Windows System Programming, 4th Edition". This is a pretty simple utility that uses WinAPI's GetProcessTimes() to obtain the very same three values. I suspect that Cygwin's time is no different in that regard.
Now the problem: GetProcessTimes() only reports times for the PID directly spawned by timep, but not its children. This makes both time and timep useless for me.
My target EXE application is typically spawned through a BAT file which invokes one more BAT file; both BATs are meant to tune environment or alter command line arguments:
timep.exe
|
+---wrapper.bat
|
+--- real-wrapper.bat
|
+--- application.exe
Times reported for wrapper.bat alone tell nothing about application.exe.
Obviously, process creation models of POSIX (fork-exec) and Win32 (CreateProcess) are very different, which makes my goal that hard to achieve on Windows.
I want to try to write my own variant of time. It has to sum up times for given process and all his children, grandchildren etc., recursively. So far I can imagine the following approach:
CreateProcess() and get its PID (root PID) and handle; add this handle to a list
Enumerate all processes in system; for each process
Compare its PID with root PID. If equal, get PID and handle for it, add it to the handle list.
For every new PID, repeat process scan phase to collect more children handles
Recurse down until no new process handles are added to the list
Wait for all collected handles from the list to terminate.
For each handle, call GetProcessTimes() and sum them up
Report results
This algorithm is bad because it is racy — children processes may be created late in the life of any process, or they can terminate before we get a chance to obtain their handle. In both cases, reported result times will be incorrect.
My question is: Is there a better solution?
EDIT: I was able to achieve my goal by using Job Objects. Below is a code snippet extracted from my application, relevant to obtaining kernel and user times from a process and all of its children. Hopefully it will save some time for someone.
I tested it with Windows 8.1 x64 and VS 2015, but it should be backwards-portable to at least Windows 7. Some fiddling might be required for 32-bit hosts (I am not sure) in regard to long long types - I am not familiar with CL.EXE's ways of dealing with them on such platforms.
#include <windows.h>
#include <string>
#include <cassert>
#include <iostream>
/* ... */
STARTUPINFO startUp;
PROCESS_INFORMATION procInfo;
/* Start program in paused state */
PROCESS_INFORMATION procInfo;
if (!CreateProcess(NULL, CmdParams, NULL, NULL, TRUE,
CREATE_SUSPENDED | NORMAL_PRIORITY_CLASS, NULL, NULL, &startUp, &procInfo)) {
DWORD err = GetLastError();
// TODO format error message
std::cerr << "Unable to start the process: " << err << std::endl;
return 1;
}
HANDLE hProc = procInfo.hProcess;
/* Create job object and attach the process to it */
HANDLE hJob = CreateJobObject(NULL, NULL); // XXX no security attributes passed
assert(hJob != NULL);
int ret = AssignProcessToJobObject(hJob, hProc);
assert(ret);
/* Now run the process and allow it to spawn children */
ResumeThread(procInfo.hThread);
/* Block until the process terminates */
if (WaitForSingleObject(hProc, INFINITE) != WAIT_OBJECT_0) {
DWORD err = GetLastError();
// TODO format error message
std::cerr << "Failed waiting for process termination: " << err << std::endl;
return 1;
}
DWORD exitcode = 0;
ret = GetExitCodeProcess(hProc, &exitcode);
assert(ret);
/* Calculate wallclock time in nanoseconds.
Ignore user and kernel times (third and fourth return parameters) */
FILETIME createTime, exitTime, unusedTime;
ret = GetProcessTimes(hProc, &createTime, &exitTime, &unusedTime, &unusedTime);
assert(ret);
LONGLONG createTimeNs = (LONGLONG)createTime.dwHighDateTime << 32 | createTime.dwLowDateTime;
LONGLONG exitTimeNs = (LONGLONG)exitTime.dwHighDateTime << 32 | exitTime.dwLowDateTime;
LONGLONG wallclockTimeNs = exitTimeNs - createTimeNs;
/* Get total user and kernel times for all processes of the job object */
JOBOBJECT_BASIC_ACCOUNTING_INFORMATION jobInfo;
ret = QueryInformationJobObject(hJob, JobObjectBasicAccountingInformation,
&jobInfo, sizeof(jobInfo), NULL);
assert(ret);
if (jobInfo.ActiveProcesses != 0) {
std::cerr << "Warning: there are still "
<< jobInfo.ActiveProcesses
<< " alive children processes" << std::endl;
/* We may kill survived processes, if desired */
TerminateJobObject(hJob, 127);
}
/* Get kernel and user times in nanoseconds */
LONGLONG kernelTimeNs = jobInfo.TotalKernelTime.QuadPart;
LONGLONG userTimeNs = jobInfo.TotalUserTime.QuadPart;
/* Clean up a bit */
CloseHandle(hProc);
CloseHandle(hJob);
Yes, from timep.exe create a job, and use job accounting. Child processes (unless created in their own jobs) share the job with their parent process.
This pretty much skips your steps 2-4
I've packaged the solution for this problem into a standalone program for Windows called chronos. It creates a job object and then spawns a requested process inside it. All the children spawned later stay in the same job object and thus can be accounted later.

Wait for kernel to finish OpenCL

My OpenCL program doesn't always finish before further host (c++) code is executed. The OpenCL code is only executed up to a certain point (which apperears to be random). The code is shortened a bit, so there may be a few things missing.
cl::Program::Sources sources;
string code = ResourceLoader::loadFile(filename);
sources.push_back({ code.c_str(),code.length() });
program = cl::Program(OpenCL::context, sources);
if (program.build({ OpenCL::default_device }) != CL_SUCCESS)
{
exit(-1);
}
queue = CommandQueue(OpenCL::context, OpenCL::default_device);
kernel = Kernel(program, "main");
Buffer b(OpenCL::context, CL_MEM_READ_WRITE, size);
queue.enqueueWriteBuffer(b, CL_TRUE, 0, size, arg);
buffers.push_back(b);
kernel.setArg(0, this->buffers[0]);
vector<Event> wait{ Event() };
Version 1:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, NULL, &wait[0]);
Version 2:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, &wait, NULL);
.
wait[0].wait();
queue.finish();
Version 1 just does not wait for the OpenCL program. Version 2 crashes the program (at queue.enqueueNDRangeKernel):
Exception thrown at 0x51D99D09 (nvopencl.dll) in foo.exe: 0xC0000005: Access violation reading location 0x0000002C.
How would one make the host wait for the GPU to finish here?
EDIT: queue.enqueueNDRangeKernel returns -1000. While it returns 0 on a rather small kernel
Version 1 says to signal wait[0] when the kernel is finished - which is the right thing to do.
Version 2 is asking your clEnqueueNDRangeKernel() to wait for the events in wait before it starts that kernel [which clearly won't work].
On it's own, queue.finish() [or clFinish()] should be enough to ensure that your kernel has completed.
Since you haven'd done clCreateUserEvent, and you haven't passed it into anything else that initializes the event, the second variant doesn't work.
It is rather bad that it crashes [it should return "invalid event" or some such - but presumably the driver you are using doesn't have a way to check that the event hasn't been initialized]. I'm reasonably sure the driver I work with will issue an error for this case - but I try to avoid getting it wrong...
I have no idea where -1000 comes from - it is neither a valid error code, nor a reasonable return value from the CL C++ wrappers. Whether the kernel is small or large [and/or completes in short or long time] shouldn't affect the return value from the enqueue, since all that SHOULD do is to enqueue the work [with no guarantee that it starts until a queue.flush() or clFlush is performed]. Waiting for it to finish should happen elsewhere.
I do most of my work via the raw OpenCL API, not the C++ wrappers, which is why I'm referring to what they do, rather than the C++ wrappers.
I faced a similar problem with OpenCL that some packages of a data stream we're not processed by OpenCL.
I realized it just happens while the notebook is plugged into a docking station.
Maybe this helps s.o.
(No clFlush or clFinish calls)

Measuring execution time of OpenCL kernels

I have the following loop that measures the time of my kernels:
double elapsed = 0;
cl_ulong time_start, time_end;
for (unsigned i = 0; i < NUMBER_OF_ITERATIONS; ++i)
{
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, NULL, 0, NULL, &event); checkErr(err, "Kernel run");
err = clWaitForEvents(1, &event); checkErr(err, "Kernel run wait fro event");
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL); checkErr(err, "Kernel run get time start");
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL); checkErr(err, "Kernel run get time end");
elapsed += (time_end - time_start);
}
Then I divide elapsed by NUMBER_OF_ITERATIONS to get the final estimate. However, I am afraid the execution time of individual kernels is too small and hence can introduce uncertainty into my measurement. How can I measure the time taken by all NUMBER_OF_ITERATIONS kernels combined?
Can you suggest a profiling tool, which could help with this, as I do not need to access this data programmatically. I use NVIDIA's OpenCL.
you need follow next steps to measure the execution time of OpenCL kernel execution time:
Create a queue, profiling need been enable when the queue is created:
cl_command_queue command_queue;
command_queue = clCreateCommandQueue(context, devices[deviceUsed], CL_QUEUE_PROFILING_ENABLE, &err);
Link an event when launch a kernel
cl_event event;
err=clEnqueueNDRangeKernel(queue, kernel, woridim, NULL, workgroupsize, NULL, 0, NULL, &event);
Wait for the kernel to finish
clWaitForEvents(1, &event);
Wait for all enqueued tasks to finish
clFinish(queue);
Get profiling data and calculate the kernel execution time (returned by the OpenCL API in nanoseconds)
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f milliseconds \n",nanoSeconds / 1000000.0);
The profiling function returns nano seconds, and is very accurate (~50ns), however, the execution has different execution times, depending on others minor issues you can't control.
This reduces your problematic on what you want to measure:
Measuring the kernel execution time: Your approach is correct, the accuracy of the average execution time measured will increase as you increase N. This accounts only for the execution time, no overhead taken in consideration.
Measuring the kernel execution time + overhead: You should use events as well but measure since CL_PROFILING_COMMAND_SUBMIT, to account for extra execution overhead.
Measuring the real host side execution time: You should use events as well but measure since the first event start, to the last event end. Using CPU timing measurement is another possibility. If you want to measure this, then you should remove the waitforevents from the loop, to allow maximum throughput to the OpenCL system (and less overhead possible).
Answering the Tools question, I recomend to use nVIDIA visual profiler. BUt since is no longer available for OpenCL, you should use the Visual Studio Add on or an old version (CUDA 3.0) of the nvprofiler.
The time that is measured is returned in nanoseconds, but you're right: The resolution of the timer is lower. However, I wonder what the actual execution time of your kernel is when you say that the time is too short to measure accurately (my gut feeling is that the resolution should be in the range of microseconds).
The most appropriate way of measuring the total time of multiple iterations depends on what "multiple" means here. Is NUMBER_OF_ITERATIONS=5 or NUMBER_OF_ITERATIONS=500000? If the number of iterations is "large", you may just use the system clock, possibly with OS-specific functions like QueryPerformanceCounter on windows (also see, for example, Is there a way to measure time up to micro seconds using C standard library? ), but of course, the precision of the system clock might be lower than that of the OpenCL device, so whether this makes sense really depends on the number of iterations.
It's a pity that NVIDIA removed OpenCL support from their Visual Profiler, though...
On Intel's OpenCL GPU implementation I've been successful with your approach (timing per kernel) and prefer it to batching a stream of NDRanges.
An alternate approach is to run it N times with and measure time with markers events as in the approach proposed in this question (the question not the answer).
Times for short kernels are generally at least in the microseconds realm in my experience.
You can check the timer resolution using clGetDeviceInfo with CL_DEVICE_PROFILING_TIMER_RESOLUTION (e.g. 80 ns on my setup).