OpenCL / OpenGL Implicit Synchronization on AMD Tahiti

OpenCL / OpenGL Implicit Synchronization on AMD Tahiti - c++

I'm having a problem with the "implicit synchronization" of OpenCL and OpenGL on an AMD Tahiti (AMD Radeon HD 7900 Series) device. The device has the cl/gl extensions, cl_khr_gl_sharing, and cl_khr_gl_event.
When I run the program which is just a simple vbo update kernel, and draw it as a white line with simple shader, it hiccups like crazy, stalling what looks to be 2-4 frames every update. I can confirm that it isn't the cl kernel or gl shader that I'm using to update and draw the buffer, because if I put glFinish and commandQueue.finish() before and after the acquire and release of gl objects for the cl update, everything works as it should.
So, I figured that I needed to "enable" the event extension...
#pragma OPENCL EXTENSION cl_khr_gl_event : enable
...in the cl program, but that throws errors. I assume this extension isn't a client facing extension and is supposed to just work as "expected", which is why I can't enable it.
The third behavior that I noticed...if I take out the glFinish() and commandQueue.finish(), and run it in CodeXL debug, the implicit synchronization works. As in, without any changes to the code base, like forcing synchronization with finish, CodeXL allows for implicit synchronization. So, implicit sync clearly works, but I can't get it to work by just running the application regularly through Visual Studio and forcing synchronization.
Clearly I'm missing something, but I honestly can't see it. Any thoughts or explanations would be greatly appreciated, as I'd love to keep the synchronization implicit.

I'm guessing you're not using the GLsync-cl_event synchro (GL_ARB_cl_event and cl_khr_gl_event extensions), which is why adding cl/glFinish and the overhead from CodeXL are helping.
My guess is your code looks like:
A1. clEnqueueNDRangeKernel
A2. clEnqueueReleaseObjects
[here is where you inserted clFinish]
B1. glDraw*
B2. wgl/glXSwapBuffers
[here is where you inserted glFinish]
C1. clEnqueueAcquireObjects
[repeat from A1]
Instead, you should:
CL->GL synchro: have clEnqueueReleaseObjects create an (output) event to be passed to glCreateSyncFromCLeventARB, then use glWaitSync (NOT glClientWaitSync - which in this case would be the same as clFinish).
GL->CL synchro: have clEnqueueAcquireObjects take an (input) event, which will be created with clCreateFromGLsync, taking a sync object from glFenceSync
Overall, it should be:
A1. `clEnqueueNDRangeKernel`
[Option 1.1:]
A2. `clEnqueueReleaseObjects`( ..., 0, NULL, &eve1)
[Option 1.2:]
A2. `clEnqueueReleaseObjects`( ..., 0, NULL, NULL)
A2'. `clEnqueueMarker`(&eve1)
A3. sync1 = glCreateSyncFromCLeventARB(eve1)
* clReleaseEvent(eve1)
A4. glWaitSync(sync1)
* glDeleteSync(sync1)
B1. glDraw*
B2. wgl/glXSwapBuffers
B3. sync2 = glFenceSync
B4. eve2 = clCreateFromGLSync(sync2)
* glDeleteSync(sync2)
[Option 2.1:]
C1. clEnqueueAcquireObjects(, ..., 1, &eve2, NULL)
* clReleaseEvent(eve2)
[Option 2.2:]
B5. clEnqueueWaitForEvents(1, &eve2)
* clReleaseEvent(eve2)
C1. clEnqueueAcquireObjects(, ..., 0, NULL, NULL)
[Repeat from A1]
(Options 1.2 / 2.2 are better if you don't exactly know in advance what will be the last enqueue before handing control over to the other API)
As a side note, I assumed you're not using an out-of-order queue for OpenCL (there really shouldn't be a need for one in this case) - if you did, you of course have to also synchro clEnqueueAcquire -> clEnqueueNDRange -> clEnqueueRelease.

Related

Strange OpenXR Behaviour on xrEndFrame

Firstly: Context is VR with HP Reverb G2, WMR runtime, DX12.
We're seeing some unexplained behaviour across developer machines when working with OpenXR. It looks as thought the OpenXR runtime is changing the way it presents depending on the machine setting for preferred GPU.
More specifically, we noticed that depending on the machines setting for preferred GPU, we are seeing a different method used when XrEndFrame is called. This is a big deal as the different method results in a blank screen being drawn into our current renderTarget!
The difference is that when the preferred device is an Nvidia GPU, xrEndFrame looks like this in PIX (in a graphics queue that is separate to our main render):
Index Global ID Name EOP to EOP Duration (ns) Execution Start Time (ns)
2 8063 Signal(pFence:obj#20, Value:62)
3 8064 Wait(pFence:obj#36, Value:31)
5 8065 CopyTextureRegion(pDst:{pResource:obj#4083, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, DstX:0, DstY:0, DstZ:0, pSrc:{pResource:obj#4084, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, pSrcBox:{left:0, top:0, front:0, right:2088, bottom:2036, back:1})
6 8066 CopyTextureRegion(pDst:{pResource:obj#4083, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:1}, DstX:0, DstY:0, DstZ:0, pSrc:{pResource:obj#4085, Type:D3D12_TEXTURE_COPY_TYPE_SUBRESOURCE_INDEX, SubresourceIndex:0}, pSrcBox:{left:0, top:0, front:0, right:2088, bottom:2036, back:1})
8 8067 Signal(pFence:obj#20, Value:63)
9 8068 Signal(pFence:obj#21, Value:31)
and when it isn't, (i.e. somehow maybe picking up Intel onboard?) it looks like this:
Index Global ID Name EOP to EOP Duration (ns) Execution Start Time (ns)
0 8064 Wait(pFence:obj#45, Value:21)
2 8065 ClearRenderTargetView(RenderTargetView:res#4008, ColorRGBA:{Element:0, Element:0, Element:0, Element:0}, NumRects:0, pRects:nullptr)
15 8066 DrawIndexedInstanced(IndexCountPerInstance:4, InstanceCount:2, StartIndexLocation:0, BaseVertexLocation:0, StartInstanceLocation:0)
17 8067 Signal(pFence:obj#22, Value:23)
18 8068 Signal(pFence:obj#23, Value:21)
The latter is clearing the current renderTargetView and drawing a quad over the top that is the dimensions of the headset display.
Yet- we've checked the rendering code and it is definitely not selecting the Intel graphics device. However the second behaviour goes away if we set 'preferred graphics processor' to nvidia gpu in nvidia control panel.
We can also see that the above behaviour is the result of a call to XrEndFrame, and that our rendering code is identical otherwise.
Any clue as to what part of the runtime might be looking at or influenced by this setting?
Unfortunately (fortuitously?) we found we need to work on the rendering code to be able to swap runtimes to say SteamVR, so right now we can't swap out the runtime.
Obviously we have a workaround, which is to set the preferred device. But understanding how/why this issue is occurring would be great.

So this was finally tracked down to an error on our part.
In our case we were using xrGetD3D12GraphicsRequirementsKHR to get the minimum graphics requirements for openxr.
This has an adapterLuid identifier in the structure XrGraphicsRequirementsD3D12KHR which we should have been using to select the GPU in the graphics API, but weren't.

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).

If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.

So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

WSI synchronization subpass dependency and link to color attachment output

I think I do understand how Vulkan synchronization works, but I do have a problem of understanding the synchronization with the WSI.
The the Synchronization Examples, we can find this code
/* Only need a dependency coming in to ensure that the first
layout transition happens at the right time.
Second external dependency is implied by having a different
finalLayout and subpass layout. */
VkSubpassDependency dependency = {
.srcSubpass = VK_SUBPASS_EXTERNAL,
.dstSubpass = 0,
// .srcStageMask needs to be a part of pWaitDstStageMask in the WSI semaphore.
.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.srcAccessMask = 0,
.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT,
.dependencyFlags = 0};
According to me, it should be something like that :
VkSubpassDependency dependency = {
.srcSubpass = VK_SUBPASS_EXTERNAL,
.dstSubpass = 0,
// .srcStageMask needs to be a part of pWaitDstStageMask in the WSI semaphore.
.srcStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE,
.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT,
.srcAccessMask = 0,
.dstAccessMask = 0,
.dependencyFlags = 0};
Indeed, since we are going to WRITE into the attachment, there is no need to use a WRITE_BIT (meaning make writes available) in the dstAccessMask.
But the true issue is in the srcStageMask.
I understand why the dstStageMask is COLOR_ATTACHMENT_OUTPUT_BIT. It is because it is okay to have the prior stages working since we don't touch the attachment.
However, for the srcStageMask, I did not see anything about the link between WSI and the COLOR_ATTACHMENT_OUTPUT_BIT. To me, the layout transition must appear at the end of the presentation, and just before the beginning of the COLOR_ATTACHMENT_OUTPUT stage.
And the end of presentation, for me, should be represented by BOTTOM_OF_PIPE and not COLOR_ATTACHMENT_OUTPUT
Where am I mistaken?

Indeed, since we are going to WRITE into the attachment, there is no need to use a WRITE_BIT (meaning make writes available) in the dstAccessMask.
I am not exactly sure of your logic here. It should be WRITE exactly because we are going to write to the attachment.
It would perhaps help describe what is happening here (from krOoze/Hello_Triangle/doc):
The VkSubpassDependency chains of the the semaphore (via pWaitDstStageMask). Then it performs its layout transition. Then it syncs with the Load Op. (Then Load Op happens in the subpass. And subpass vkDraws some stuff into the swapchain image.)
Now assumably (as is typical for first use of the swapchain image) our Load Op is VK_ATTACHMENT_LOAD_OP_CLEAR. That means VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT + VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT access.
So:
Presentation is gonna read the swapchain image, and submit a semaphore signal.
We chain our VkSubpassDependency to the signal (via pWaitDstStageMask). Semaphore signal already covers all memory accesses, therefore our srcAccessMask = 0.
The Dependency performs its implicit dependency to the Layout Transition (takes your src, and invents some internal dst), then Layout Transition happens, which reads the image and writes it back, then the dependency performs another implicit Dependency (invents some src that covers the layout transition, and uses your dst).
The Load Op happens in the subpass, and you have to make sure explicitly it happens-after everything above. So your dst in your Dependency must be VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT + VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT to cover the Load Op.
Now, there are three types of memory hazards: read→write, write→write, and write→read (and read→read is a non-hazard). In WWH something writes a resource, while some other write also writes it. It is not clear which write happens last, so the memory contents becomes garbage. In RWH and WRH read and write may happen at the same time. It is not clear if the read sees the unmodified memory, or the written one (i.e. it reads garbage).
Without an explicit dependency, the Layout Transition and the subsequent Load Op form a write→write hazard. Therefore dstAccessMask must not be 0 to resolve the hazard and make sure one write happens-before the second one.
(It is perhaps worth noting we introduce the VkSubpassDependency solely for the sake of the layout transition. Otherwise the semaphore wait would already be all that is needed. The default is srcStageMask = TOP_OF_PIPE, which means without explicit Dependency the layout transition could happen before the semaphore wait, i.e. before presentation finishes reading it; that would be a read→write hazard).
To me, the layout transition must appear at the end of the presentation, and just before the beginning of the COLOR_ATTACHMENT_OUTPUT stage. And the end of presentation, for me, should be represented by BOTTOM_OF_PIPE and not COLOR_ATTACHMENT_OUTPUT
We have little bit of choice here: pWaitDstStageMask = dependency.srcStageMask = ?
Now our situation is this:
vkBeginCommandBuffer();
[possibly vkCmdDispatch()?]
vkBeginRenderPass();
vkCmdDraw(); // does vertex shading + fragment shading, then color writes
vkEndRenderPass();
vkEndCommandBuffer();
vkQueueSubmit(.pwaitDstStageMask = ?);
If we use VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, then the semaphore wait does not block starting the hypothetical vkCmdDispatch() (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT). And the Subpass Dependency src does not force it to finish neither. Great!
The Stage used should not be any earlier stage than VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT. E.g. with VK_PIPELINE_STAGE_ALL_COMMANDS the semaphore wait would unnecesserily block the vkCmdDispatch() as well as vertex and fragment shader of the vkCmdDraw. Meanwhile we need the swapchain image only when we actually need to write it at the Load Op (which you can imagine happens when the vkCmdDraw() gets to the point it needs to perform color writes).
Now, we could choose VK_PIPELINE_STAGE_BOTTOM_OF_PIPE. The semaphore blocks nothing (dstStage\pWaitDstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE means the same as "nothing"). Great! But the Subpass Dependency now blocks everything (srcStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE means the same as VK_PIPELINE_STAGE_ALL_COMMANDS). That means our vkCmdDispatch has to finish, before our subpass starts. Not so great...
So, the best practice is to use pWaitDstStageMask = dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT.

Wait for kernel to finish OpenCL

My OpenCL program doesn't always finish before further host (c++) code is executed. The OpenCL code is only executed up to a certain point (which apperears to be random). The code is shortened a bit, so there may be a few things missing.
cl::Program::Sources sources;
string code = ResourceLoader::loadFile(filename);
sources.push_back({ code.c_str(),code.length() });
program = cl::Program(OpenCL::context, sources);
if (program.build({ OpenCL::default_device }) != CL_SUCCESS)
{
exit(-1);
}
queue = CommandQueue(OpenCL::context, OpenCL::default_device);
kernel = Kernel(program, "main");
Buffer b(OpenCL::context, CL_MEM_READ_WRITE, size);
queue.enqueueWriteBuffer(b, CL_TRUE, 0, size, arg);
buffers.push_back(b);
kernel.setArg(0, this->buffers[0]);
vector<Event> wait{ Event() };
Version 1:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, NULL, &wait[0]);
Version 2:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, &wait, NULL);
.
wait[0].wait();
queue.finish();
Version 1 just does not wait for the OpenCL program. Version 2 crashes the program (at queue.enqueueNDRangeKernel):
Exception thrown at 0x51D99D09 (nvopencl.dll) in foo.exe: 0xC0000005: Access violation reading location 0x0000002C.
How would one make the host wait for the GPU to finish here?
EDIT: queue.enqueueNDRangeKernel returns -1000. While it returns 0 on a rather small kernel

Version 1 says to signal wait[0] when the kernel is finished - which is the right thing to do.
Version 2 is asking your clEnqueueNDRangeKernel() to wait for the events in wait before it starts that kernel [which clearly won't work].
On it's own, queue.finish() [or clFinish()] should be enough to ensure that your kernel has completed.
Since you haven'd done clCreateUserEvent, and you haven't passed it into anything else that initializes the event, the second variant doesn't work.
It is rather bad that it crashes [it should return "invalid event" or some such - but presumably the driver you are using doesn't have a way to check that the event hasn't been initialized]. I'm reasonably sure the driver I work with will issue an error for this case - but I try to avoid getting it wrong...
I have no idea where -1000 comes from - it is neither a valid error code, nor a reasonable return value from the CL C++ wrappers. Whether the kernel is small or large [and/or completes in short or long time] shouldn't affect the return value from the enqueue, since all that SHOULD do is to enqueue the work [with no guarantee that it starts until a queue.flush() or clFlush is performed]. Waiting for it to finish should happen elsewhere.
I do most of my work via the raw OpenCL API, not the C++ wrappers, which is why I'm referring to what they do, rather than the C++ wrappers.

I faced a similar problem with OpenCL that some packages of a data stream we're not processed by OpenCL.
I realized it just happens while the notebook is plugged into a docking station.
Maybe this helps s.o.
(No clFlush or clFinish calls)

DirectX12 - ExecuteCommandLists and Present function

I found that in the Microsoft sample example:
void D3D12HelloTriangle::OnRender()
{
// Record all the commands we need to render the scene into the command list.
PopulateCommandList();
// Execute the command list.
ID3D12CommandList* ppCommandLists[] = { m_commandList.Get() };
m_commandQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);
// Present the frame.
ThrowIfFailed(m_swapChain->Present(1, 0));
WaitForPreviousFrame();
}
How does is actually work ? ExecuteCommandLists is a asynchronous function call, so it means the code execution will continue and it hits Present function.
What will happen after Present call ? Let's say, GPU is still drawing, working and present is called. Is Present sychronous call ? It cannot present buffer when gpu is still drawing. Is that correct ? Could someone explain what's happening here ?

Present is also an asynchronous command that tells the GPU to start scanning out (displaying) from the next buffer in the swap chain. You don't have to worry about the GPU not having finished executing all previously issued work (on the graphics command queue) before the 'Flip' takes place.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js