OpenCL event-based timer (c++ wrapper) - c++

I am new to OpenCL and met some problems measuring kernel runtime.
I use a c++ wrapper to deal with event and profiling.
context = Context(device);
queue = CommandQueue(context, device);
/** do something */
cl_ulong time_end, time_start, time_total;
Event event;
// Launch the kernel
queue.enqueueNDRangeKernel(kernel, NULL, global_work_size, local_work_size, NULL, &event);
queue.finish();
// Get event info and print GPU runtime
event.wait();
event.getProfilingInfo(CL_PROFILING_COMMAND_START, &time_start);
event.getProfilingInfo(CL_PROFILING_COMMAND_END, &time_end);
time_total = time_end - time_start;
printf("%lu\t%lu\t", time_total);
printf("\nRendering time (ms): \t%lu\n\n", time_total);
I got result which is obviously not in the scale it should be.
6052843157020279026 140734592538400
Rendering time (ms): 12394041651281810990
Then I use normal timer provided by sys/time.h and got 0.02s.
Did I miss something critical in my code? Something like not having synchronized or invalid initialization? Thanks!

Command Queues need to be created with the express request to include profiling information, or their profiling commands won't work correctly.
I don't recognize the API you're using to create OpenCL objects, but this is what that would look like in the native OpenCL API:
//OpenCL versions 1.1-
cl_command_queue queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
//OpenCL versions 1.2+
cl_command_queue_properties properties[] {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0};
cl_command_queue queue = clCreateCommandQueueWithProperties(context, device, properties, &err);

Related

How to make asynchronous read of a file with IOCP?

I have faced an implementation problem. I am puzzled on how to implement IOCP. I have read a lot on the Internet about it, but still missing one step.
So far what I have learnt is as follows:
In order to use IOCP:
on an init function:
CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 0); // to have a max thread number available
handler = CreateFile(filename, GENERIC_READ | GENERIC_WRITE, 0, 0, OPEN_EXISTING, FILE_FLAG_OVERLAPPED , 0);
CreateIoCompletionPort(handler, NULL, 0, 0); // to associate my handler with an IOCP
on a read funcion I can do sth like that:
ReadFile(..., &Overlapped); // this will return error == ERROR_IO_PENDING which is what I want - asynch read
now I have difficulties to understand next steps. Should I spawn a thread after ReadFile and wait inside that thread until GetQueuedCompletionStatus is true?
So the answer for my question is here:
https://stackoverflow.com/a/680416/2788176
In very simplistic (and a little over-simplified) terms, you tell the
IOCP about the IO jobs you want done. It will perform them
asynchronously and maintain a queue of the results of each of those
jobs. Your call to tell the IOCP about the job returns immediately (it
does not block while the IO happens). You are returned an object that
is conceptually like the .NET IAsyncResult ... it lets you block if
you choose to, or you can provide a callback, or you can periodically
poll to see if the job is complete.
IOCP implementation can be found in windows SDK.

Are OpenCL kernels executed asynchronously?

For CUDA, I know they are executed asynchronously after issuing the launch commands to the default stream(null stream), so how about that in OpenCL? Sample codes are as follows:
cl_context context;
cl_device_id device_id;
cl_int err;
...
cl_kernel kernel1;
cl_kernel kernel2;
cl_command_queue Q = clCreateCommandQueue(context, device_id, 0, &err);
...
size_t global_w_offset[3] = {0,0,0};
size_t global_w_size[3] = {16,16,1};
size_t local_w_size[3] = {16,16,1};
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
clFinish(Q);
Will kernel1 and kernel2 be executed asynchronously after commands enqueued?(i.e. executions overlap)
Update
According to the OpenCL Reference, It seems set properties as CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE in clCreateCommandQueue can meet my need. But, Does out_of_order mean asynchronization?
Does out_of_order mean asynchronization
"Out of order" queue means kernels may execute in different order than they were queued (if their event/data dependencies allow it). They also may execute concurrently, but not necessarily.
Also, asynchronous execution means something else than execution overlap (that's called parallel execution or concurrency). Asynchronous execution means that kernel code on device executes independently of host code - which is always true in OpenCL.
The simple way to get concurrency (execution overlap) is by using >1 queues on the same device. This works even on implementations which don't have Out-of-order queue capability. It does not guarantee execution overlap (because OpenCL can be used on much more devices than CUDA, and on some devices you simply can't execute >1 kernel at a time), but in my experience with most GPUs you should get at least some overlap. You need to be careful about buffers used by kernels in separate queues, though.
In your current code:
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
kernel1 finishes first and then kernel2 is executed
Using
clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
you can execute multiple different kernels concurrently though it isn't guranteed.
Beware though, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported in all OpenCL implementations. This also means that you have no guarantee that kernel1 will finish execution before kernel2. If any objects that are output by kernel1 are required as input in kernel2, it may fail.
Also multiple command queues can be created and enqueued with commands and the reason for their existence is because the problem you wish to solve might involve some, if not all of the heterogeneous devices in the host. And they could represent independent streams of computation where no data is shared, or dependent streams of computation where each subsequent task depends on the previous task (often, data is shared). However, these command queues will execute on the device without synchronization, provided that no data is shared. If data is shared, then the programmer needs to ensure synchronization of the data using synchronization commands in the OpenCL specification.

Wait for kernel to finish OpenCL

My OpenCL program doesn't always finish before further host (c++) code is executed. The OpenCL code is only executed up to a certain point (which apperears to be random). The code is shortened a bit, so there may be a few things missing.
cl::Program::Sources sources;
string code = ResourceLoader::loadFile(filename);
sources.push_back({ code.c_str(),code.length() });
program = cl::Program(OpenCL::context, sources);
if (program.build({ OpenCL::default_device }) != CL_SUCCESS)
{
exit(-1);
}
queue = CommandQueue(OpenCL::context, OpenCL::default_device);
kernel = Kernel(program, "main");
Buffer b(OpenCL::context, CL_MEM_READ_WRITE, size);
queue.enqueueWriteBuffer(b, CL_TRUE, 0, size, arg);
buffers.push_back(b);
kernel.setArg(0, this->buffers[0]);
vector<Event> wait{ Event() };
Version 1:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, NULL, &wait[0]);
Version 2:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, &wait, NULL);
.
wait[0].wait();
queue.finish();
Version 1 just does not wait for the OpenCL program. Version 2 crashes the program (at queue.enqueueNDRangeKernel):
Exception thrown at 0x51D99D09 (nvopencl.dll) in foo.exe: 0xC0000005: Access violation reading location 0x0000002C.
How would one make the host wait for the GPU to finish here?
EDIT: queue.enqueueNDRangeKernel returns -1000. While it returns 0 on a rather small kernel
Version 1 says to signal wait[0] when the kernel is finished - which is the right thing to do.
Version 2 is asking your clEnqueueNDRangeKernel() to wait for the events in wait before it starts that kernel [which clearly won't work].
On it's own, queue.finish() [or clFinish()] should be enough to ensure that your kernel has completed.
Since you haven'd done clCreateUserEvent, and you haven't passed it into anything else that initializes the event, the second variant doesn't work.
It is rather bad that it crashes [it should return "invalid event" or some such - but presumably the driver you are using doesn't have a way to check that the event hasn't been initialized]. I'm reasonably sure the driver I work with will issue an error for this case - but I try to avoid getting it wrong...
I have no idea where -1000 comes from - it is neither a valid error code, nor a reasonable return value from the CL C++ wrappers. Whether the kernel is small or large [and/or completes in short or long time] shouldn't affect the return value from the enqueue, since all that SHOULD do is to enqueue the work [with no guarantee that it starts until a queue.flush() or clFlush is performed]. Waiting for it to finish should happen elsewhere.
I do most of my work via the raw OpenCL API, not the C++ wrappers, which is why I'm referring to what they do, rather than the C++ wrappers.
I faced a similar problem with OpenCL that some packages of a data stream we're not processed by OpenCL.
I realized it just happens while the notebook is plugged into a docking station.
Maybe this helps s.o.
(No clFlush or clFinish calls)

Measuring execution time of OpenCL kernels

I have the following loop that measures the time of my kernels:
double elapsed = 0;
cl_ulong time_start, time_end;
for (unsigned i = 0; i < NUMBER_OF_ITERATIONS; ++i)
{
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, NULL, 0, NULL, &event); checkErr(err, "Kernel run");
err = clWaitForEvents(1, &event); checkErr(err, "Kernel run wait fro event");
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL); checkErr(err, "Kernel run get time start");
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL); checkErr(err, "Kernel run get time end");
elapsed += (time_end - time_start);
}
Then I divide elapsed by NUMBER_OF_ITERATIONS to get the final estimate. However, I am afraid the execution time of individual kernels is too small and hence can introduce uncertainty into my measurement. How can I measure the time taken by all NUMBER_OF_ITERATIONS kernels combined?
Can you suggest a profiling tool, which could help with this, as I do not need to access this data programmatically. I use NVIDIA's OpenCL.
you need follow next steps to measure the execution time of OpenCL kernel execution time:
Create a queue, profiling need been enable when the queue is created:
cl_command_queue command_queue;
command_queue = clCreateCommandQueue(context, devices[deviceUsed], CL_QUEUE_PROFILING_ENABLE, &err);
Link an event when launch a kernel
cl_event event;
err=clEnqueueNDRangeKernel(queue, kernel, woridim, NULL, workgroupsize, NULL, 0, NULL, &event);
Wait for the kernel to finish
clWaitForEvents(1, &event);
Wait for all enqueued tasks to finish
clFinish(queue);
Get profiling data and calculate the kernel execution time (returned by the OpenCL API in nanoseconds)
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f milliseconds \n",nanoSeconds / 1000000.0);
The profiling function returns nano seconds, and is very accurate (~50ns), however, the execution has different execution times, depending on others minor issues you can't control.
This reduces your problematic on what you want to measure:
Measuring the kernel execution time: Your approach is correct, the accuracy of the average execution time measured will increase as you increase N. This accounts only for the execution time, no overhead taken in consideration.
Measuring the kernel execution time + overhead: You should use events as well but measure since CL_PROFILING_COMMAND_SUBMIT, to account for extra execution overhead.
Measuring the real host side execution time: You should use events as well but measure since the first event start, to the last event end. Using CPU timing measurement is another possibility. If you want to measure this, then you should remove the waitforevents from the loop, to allow maximum throughput to the OpenCL system (and less overhead possible).
Answering the Tools question, I recomend to use nVIDIA visual profiler. BUt since is no longer available for OpenCL, you should use the Visual Studio Add on or an old version (CUDA 3.0) of the nvprofiler.
The time that is measured is returned in nanoseconds, but you're right: The resolution of the timer is lower. However, I wonder what the actual execution time of your kernel is when you say that the time is too short to measure accurately (my gut feeling is that the resolution should be in the range of microseconds).
The most appropriate way of measuring the total time of multiple iterations depends on what "multiple" means here. Is NUMBER_OF_ITERATIONS=5 or NUMBER_OF_ITERATIONS=500000? If the number of iterations is "large", you may just use the system clock, possibly with OS-specific functions like QueryPerformanceCounter on windows (also see, for example, Is there a way to measure time up to micro seconds using C standard library? ), but of course, the precision of the system clock might be lower than that of the OpenCL device, so whether this makes sense really depends on the number of iterations.
It's a pity that NVIDIA removed OpenCL support from their Visual Profiler, though...
On Intel's OpenCL GPU implementation I've been successful with your approach (timing per kernel) and prefer it to batching a stream of NDRanges.
An alternate approach is to run it N times with and measure time with markers events as in the approach proposed in this question (the question not the answer).
Times for short kernels are generally at least in the microseconds realm in my experience.
You can check the timer resolution using clGetDeviceInfo with CL_DEVICE_PROFILING_TIMER_RESOLUTION (e.g. 80 ns on my setup).

How to make simple profiling tool for OpenCL programs?

I have a task to make a simple profiling tool (winOS) for performance/timing/event analysis of OpenCL programs. Can someone give advice how to start?
The simplest one and works accurately on all platforms:
cl_event perfEvent;
cl_ulong start=0, end=0;
float t_kernel;
/* Enqueue kernel */
clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, &perfEvent);
clWaitForEvents( 1, &perfEvent );
/* Get the execution time */
clGetEventProfilingInfo(perfEvent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(perfEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
t_kernel = (end-start)/1000000.0f;
std::cout << t_kernel << std::endl;
Take a look at AMD CodeXL. It's free and it might be just what you are looking for. Inside CodeXL, use the Application Timeline Trace mode (Profile -> Application Timeline Trace), which executes a program and generates a visual timeline that displays OpenCL events like kernel dispatch and data transfer operations.