Are OpenCL kernels executed asynchronously?

Are OpenCL kernels executed asynchronously? - c++

For CUDA, I know they are executed asynchronously after issuing the launch commands to the default stream(null stream), so how about that in OpenCL? Sample codes are as follows:
cl_context context;
cl_device_id device_id;
cl_int err;
...
cl_kernel kernel1;
cl_kernel kernel2;
cl_command_queue Q = clCreateCommandQueue(context, device_id, 0, &err);
...
size_t global_w_offset[3] = {0,0,0};
size_t global_w_size[3] = {16,16,1};
size_t local_w_size[3] = {16,16,1};
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
clFinish(Q);
Will kernel1 and kernel2 be executed asynchronously after commands enqueued?(i.e. executions overlap)
Update
According to the OpenCL Reference, It seems set properties as CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE in clCreateCommandQueue can meet my need. But, Does out_of_order mean asynchronization?

Does out_of_order mean asynchronization
"Out of order" queue means kernels may execute in different order than they were queued (if their event/data dependencies allow it). They also may execute concurrently, but not necessarily.
Also, asynchronous execution means something else than execution overlap (that's called parallel execution or concurrency). Asynchronous execution means that kernel code on device executes independently of host code - which is always true in OpenCL.
The simple way to get concurrency (execution overlap) is by using >1 queues on the same device. This works even on implementations which don't have Out-of-order queue capability. It does not guarantee execution overlap (because OpenCL can be used on much more devices than CUDA, and on some devices you simply can't execute >1 kernel at a time), but in my experience with most GPUs you should get at least some overlap. You need to be careful about buffers used by kernels in separate queues, though.

In your current code:
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
kernel1 finishes first and then kernel2 is executed
Using
clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
you can execute multiple different kernels concurrently though it isn't guranteed.
Beware though, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported in all OpenCL implementations. This also means that you have no guarantee that kernel1 will finish execution before kernel2. If any objects that are output by kernel1 are required as input in kernel2, it may fail.
Also multiple command queues can be created and enqueued with commands and the reason for their existence is because the problem you wish to solve might involve some, if not all of the heterogeneous devices in the host. And they could represent independent streams of computation where no data is shared, or dependent streams of computation where each subsequent task depends on the previous task (often, data is shared). However, these command queues will execute on the device without synchronization, provided that no data is shared. If data is shared, then the programmer needs to ensure synchronization of the data using synchronization commands in the OpenCL specification.

Related

Does vkQueuePresentKHR prevent later commands from executing while it is waiting on the semaphore?

This is kind of a follow-up question for this question, and it is also based on the code provided by the same Vulkan tutorial.
Here is a simplified example:
// Vulkan handles defined and initialized elsewhere
VkDevice device;
VkQueue queue;
VkSempahore semaphore;
VkSwapchain swapchain;
VkCommandBuffer cmd_buffer;
// Renderer code
uint32_t image_index; // image acquisition is omitted
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &semaphore;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swapchain;
present_info.pImageIndices = &image_index;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
// ... irrelevant code omitted
submit_info.pCommandBuffers = &cmd_buffer;
vkQueuePresentKHR(queue, &present_info);
vkQueueSubmit(queue, 1, &submit_info, VK_NULL_HANDLE);
In the above example, will the commands in cmd_buffer also have to wait until semaphore is signaled?
I am asking about this because a comment below the tutorial mentioned that:
However, if the graphics and present queue do end up being the same, then the renderFinished semaphore guarantees proper execution ordering. This is because the vkQueuePresentKHR command waits on that semaphore and it must begin before later commands in the queue begin (due to implicit ordering) and that only happens after rendering from the previous frame finished.

In the above example, will the commands in cmd_buffer also have to wait until semaphore is signaled?
Only if you use the semaphore as a waitSemaphore for the later submit.
This is because the vkQueuePresentKHR command waits on that semaphore and it must begin before later commands in the queue begin (due to implicit ordering) and that only happens after rendering from the previous frame finished.
I don't believe this is true.
Commands start in implicit order with respect to other commands in the queue, but this is pipelined on a stage-to-stage basis. Also note the spec wording says "start in order" not "complete in order", which is a specification sleight of hand. Hardware is perfectly free to overlap and out-of-order execution of individual commands that are otherwise sequential in the stream unless the stream contains synchronization primitives that stop it doing so.

Which way to synchronize vkQueueSubmit() to use?

I have a function that copies data from one buffer to another, I need to synchronize its execution.
I have such a bad option:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
//Waiting for completion
vkQueueWaitIdle(graphicsQueue);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
}
This option is bad because if I want to execute the copyBuffer() function several times, then all the buffers will be copied strictly one at a time.
I want to use a fence for each function call so that multiple calls can run in parallel.
So far, I have only such a solution:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Create fence
VkFenceCreateInfo fenceInfo{};
fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
fenceInfo.flags = VK_FENCE_CREATE_SIGNALED_BIT;
VkFence executionCompleteFence = VK_NULL_HANDLE;
if (vkCreateFence(logicalDevice, &fenceInfo, VK_NULL_HANDLE, &executionCompleteFence) != VK_SUCCESS) {
throw MakeErrorInfo("Failed to create fence");
}
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
vkWaitForFences(logicalDevice, 1, &executionCompleteFence, VK_TRUE, UINT64_MAX);
vkResetFences(logicalDevice, 1, &executionCompleteFence);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
vkDestroyFence(logicalDevice, executionCompleteFence, VK_NULL_HANDLE);
}
Which of these options is better?
Is the second option written correctly?

Both functions are bad in the same way. They both block the CPU from doing anything until the transfer is done. And they both could be used to potentially submit multiple CBs to the same queue in the same frame, but with different submit commands.
Neither is desirable if performance is something you care about.
Ultimately, what you need to do is have your copyBuffer function not actually perform the copy. You should have a function which builds a command buffer to do a copy. That CB is then stored in a place to be submitted later with other copying CBs. Or better yet, you can have just one copying CB that each command adds to (the first one called in a frame will create the CB).
At some point in the future, before you've submitted the work that will use this data, you need to submit the transfer operations. And the way this works depends on if you're submitting the transfer operations on the same queue as the work that will consume them or not.
If they're on the same queue, then all you need to do is have an event in a command buffer at the end of your batch that synchronizes the transfer operations with their receivers. If you want to be more clever, each transfer operation can have its own event, which the receiving operations will wait on.
And in same-queue transfers, you also want to make sure that you submit the transfers in the same vkQueueSubmit call as the rest of your work. Or to put it another way, you should never make more than one call to vkQueueSubmit for a particular queue in a particular frame.
If you're dealing with separate queues, then things change. A bit. If timeline semaphores aren't an option, you'll need to submit your transfer work before you submit the receiving operations. This is because the transfer batch will need to signal a semaphore that the receiving operation will wait on. And a binary semaphore cannot be waited on until the operation that signals it has been submitted to a queue.
But otherwise, everything else stays the same. Of course, you don't need events since you're synchronizing by semaphore.

The two functions are semantically identical and do exactly the same blocking behavior.
The second is slightly better. vkQueueWaitIdle is kind of a debug and out-of-hotspot feature. It might incur a hidden second submit to signal the implicit fence.
You don't need to reset fence that you subsequently destroy anyway. And you are creating it presignaled, which is a bug. Also you forgot to pass it to the vkQueueSubmit.

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).

If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.

So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

Wait for kernel to finish OpenCL

My OpenCL program doesn't always finish before further host (c++) code is executed. The OpenCL code is only executed up to a certain point (which apperears to be random). The code is shortened a bit, so there may be a few things missing.
cl::Program::Sources sources;
string code = ResourceLoader::loadFile(filename);
sources.push_back({ code.c_str(),code.length() });
program = cl::Program(OpenCL::context, sources);
if (program.build({ OpenCL::default_device }) != CL_SUCCESS)
{
exit(-1);
}
queue = CommandQueue(OpenCL::context, OpenCL::default_device);
kernel = Kernel(program, "main");
Buffer b(OpenCL::context, CL_MEM_READ_WRITE, size);
queue.enqueueWriteBuffer(b, CL_TRUE, 0, size, arg);
buffers.push_back(b);
kernel.setArg(0, this->buffers[0]);
vector<Event> wait{ Event() };
Version 1:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, NULL, &wait[0]);
Version 2:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, &wait, NULL);
.
wait[0].wait();
queue.finish();
Version 1 just does not wait for the OpenCL program. Version 2 crashes the program (at queue.enqueueNDRangeKernel):
Exception thrown at 0x51D99D09 (nvopencl.dll) in foo.exe: 0xC0000005: Access violation reading location 0x0000002C.
How would one make the host wait for the GPU to finish here?
EDIT: queue.enqueueNDRangeKernel returns -1000. While it returns 0 on a rather small kernel

Version 1 says to signal wait[0] when the kernel is finished - which is the right thing to do.
Version 2 is asking your clEnqueueNDRangeKernel() to wait for the events in wait before it starts that kernel [which clearly won't work].
On it's own, queue.finish() [or clFinish()] should be enough to ensure that your kernel has completed.
Since you haven'd done clCreateUserEvent, and you haven't passed it into anything else that initializes the event, the second variant doesn't work.
It is rather bad that it crashes [it should return "invalid event" or some such - but presumably the driver you are using doesn't have a way to check that the event hasn't been initialized]. I'm reasonably sure the driver I work with will issue an error for this case - but I try to avoid getting it wrong...
I have no idea where -1000 comes from - it is neither a valid error code, nor a reasonable return value from the CL C++ wrappers. Whether the kernel is small or large [and/or completes in short or long time] shouldn't affect the return value from the enqueue, since all that SHOULD do is to enqueue the work [with no guarantee that it starts until a queue.flush() or clFlush is performed]. Waiting for it to finish should happen elsewhere.
I do most of my work via the raw OpenCL API, not the C++ wrappers, which is why I'm referring to what they do, rather than the C++ wrappers.

I faced a similar problem with OpenCL that some packages of a data stream we're not processed by OpenCL.
I realized it just happens while the notebook is plugged into a docking station.
Maybe this helps s.o.
(No clFlush or clFinish calls)

Send same data to multiples kernel in OpenCL

I have multiple kernels,in the first of them i send some entries, the output I have from the first kernel is the input for the next. My queue of kernels repeat this behavior 8 times until the last kernel that sends me the real output what I need.
This is an example of what i did:
cl::Kernel kernel1 = cl::Kernel(OPETCL::program, "forward");
//agrego los argumetnos del kernel
kernel1.setArg(0, cl_rowCol);
kernel1.setArg(1, cl_data);
kernel1.setArg(2, cl_x);
kernel1.setArg(3, cl_b);
kernel1.setArg(4, sizeof(int), &largo);
//ejecuto el kernel
OPETCL::queue.enqueueNDRangeKernel(kernel1, cl::NullRange, global, local, NULL, &profilingApp);
/********************************/
/** ejecuto las simetrias de X **/
/********************************/
cl::Kernel kernel2 = cl::Kernel(OPETCL::program, "forward_symmX");
//agrego los argumetnos del kernel
kernel2.setArg(0, cl_rowCol);
kernel2.setArg(1, cl_data);
kernel2.setArg(2, cl_x);
kernel2.setArg(3, cl_b);
kernel2.setArg(4, cl_symmLOR_X);
kernel2.setArg(5, cl_symm_Xpixel);
kernel2.setArg(6, sizeof(int), &largo);
//ejecuto el kernel
OPETCL::queue.enqueueNDRangeKernel(kernel2, cl::NullRange, global, local, NULL, &profilingApp);
OPETCL::queue.finish();
OPETCL::queue.enqueueReadBuffer(cl_b, CL_TRUE, 0, sizeof(float) * lors, b, NULL, NULL);
In this case cl_b is the output what i need.
My question is if the arguments i send to kernels are the same to all kernel, but only one is different.
Is correct what i did to set arguments??
The arguments are keeping in the device during the all kernels execution??

Since you are using the same queue and OpenCL-context this is OK and your kernels can use the data (arguments) calculated by previous kernel and the data will be kept on the device.
I suggest you to use clFinish after each kernel execution to assure the previous kernel finished the calculation, before next one starts. Alternatively, you can use events, to assure that.

I think you get this behaviour for free, as long as you don't specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when you create your command queue.
It looks like you're doing it correctly. In general, this is the process:
create your buffer(s)
queue a buffer copy to the device
queue the kernel execution
repeat #3 for as many kernels as you need to run, passing the buffer as the correct parameter. Use setArg to change/add params. The buffer will still exist on the device -- and modified by the previous kernels
queue a copy of the buffer back to the host
If you do specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, you will have to use events to control the execution order of the kernels. This seems unnecessary for your example though.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js