Wait for kernel to finish OpenCL

Wait for kernel to finish OpenCL - c++

My OpenCL program doesn't always finish before further host (c++) code is executed. The OpenCL code is only executed up to a certain point (which apperears to be random). The code is shortened a bit, so there may be a few things missing.
cl::Program::Sources sources;
string code = ResourceLoader::loadFile(filename);
sources.push_back({ code.c_str(),code.length() });
program = cl::Program(OpenCL::context, sources);
if (program.build({ OpenCL::default_device }) != CL_SUCCESS)
{
exit(-1);
}
queue = CommandQueue(OpenCL::context, OpenCL::default_device);
kernel = Kernel(program, "main");
Buffer b(OpenCL::context, CL_MEM_READ_WRITE, size);
queue.enqueueWriteBuffer(b, CL_TRUE, 0, size, arg);
buffers.push_back(b);
kernel.setArg(0, this->buffers[0]);
vector<Event> wait{ Event() };
Version 1:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, NULL, &wait[0]);
Version 2:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, &wait, NULL);
.
wait[0].wait();
queue.finish();
Version 1 just does not wait for the OpenCL program. Version 2 crashes the program (at queue.enqueueNDRangeKernel):
Exception thrown at 0x51D99D09 (nvopencl.dll) in foo.exe: 0xC0000005: Access violation reading location 0x0000002C.
How would one make the host wait for the GPU to finish here?
EDIT: queue.enqueueNDRangeKernel returns -1000. While it returns 0 on a rather small kernel

Version 1 says to signal wait[0] when the kernel is finished - which is the right thing to do.
Version 2 is asking your clEnqueueNDRangeKernel() to wait for the events in wait before it starts that kernel [which clearly won't work].
On it's own, queue.finish() [or clFinish()] should be enough to ensure that your kernel has completed.
Since you haven'd done clCreateUserEvent, and you haven't passed it into anything else that initializes the event, the second variant doesn't work.
It is rather bad that it crashes [it should return "invalid event" or some such - but presumably the driver you are using doesn't have a way to check that the event hasn't been initialized]. I'm reasonably sure the driver I work with will issue an error for this case - but I try to avoid getting it wrong...
I have no idea where -1000 comes from - it is neither a valid error code, nor a reasonable return value from the CL C++ wrappers. Whether the kernel is small or large [and/or completes in short or long time] shouldn't affect the return value from the enqueue, since all that SHOULD do is to enqueue the work [with no guarantee that it starts until a queue.flush() or clFlush is performed]. Waiting for it to finish should happen elsewhere.
I do most of my work via the raw OpenCL API, not the C++ wrappers, which is why I'm referring to what they do, rather than the C++ wrappers.

I faced a similar problem with OpenCL that some packages of a data stream we're not processed by OpenCL.
I realized it just happens while the notebook is plugged into a docking station.
Maybe this helps s.o.
(No clFlush or clFinish calls)

Related

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).

If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.

So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

waveOutWrite buffers are never returned to application

I have a problem with Microsoft's WaveOut API:
edit1: Added Link to sample project:
edit2: removed link, its not representative of the issue
After playing some audio, when I want to terminate a given playback stream, I call the function:
waveOutClose(hWaveOut_);
However, even after waveOutClose() is called, sometimes the library will still access memory previously passed to it by waveOutWrite(), causing an invalid memory access.
I then tried to ensure all the buffers are marked as done before freeing the buffer:
PcmPlayback::~PcmPlayback()
{
if(hWaveOut_ == nullptr)
return;
waveOutReset(hWaveOut_); // infinite-loops, never returns
for(auto it = buffers_.begin(); it != buffers_.end(); ++it)
waveOutUnprepareHeader(hWaveOut_, &it->wavehdr_, sizeof(WAVEHDR));
while( buffers_.empty() == false ) // infinite loops
removeCompletedBuffers();
waveOutClose(hWaveOut_);
//Unhandled exception at 0x75629E80 (msvcrt.dll) in app.exe:
// 0xC0000005: Access violation reading location 0xFEEEFEEE.
}
void PcmPlayback::removeCompletedBuffers()
{
for(auto it = buffers_.begin(); it != buffers_.end();)
{
if( it->wavehdr_.dwFlags & WHDR_DONE )
{
waveOutUnprepareHeader(hWaveOut_, &it->wavehdr_, sizeof(WAVEHDR));
it = buffers_.erase(it);
}
else
++it;
}
}
However, this situation never happens - the buffer never becomes empty. There will be 4-5 blocks remaining with wavehdr_.dwFlags == 18 (I believe this means the blocks are still marked as in playback)
How can I resolve this issue?
# Martin Schlott ("Can you provide the loop where you write the buffer to waveOutWrite?")
Its not quite a loop, instead I have a function that is called whenever I receive an audio packet over the network:
void PcmPlayback::addData(const std::vector<short> &rhs)
{
removeCompletedBuffers();
if(rhs.empty())
return;
// add new data
buffers_.push_back(Buffer());
Buffer & buffer = buffers_.back();
buffer.data_ = rhs;
ZeroMemory(&buffers_.back().wavehdr_, sizeof(WAVEHDR));
buffer.wavehdr_.dwBufferLength = buffer.data_.size() * sizeof(short);
buffer.wavehdr_.lpData = (char *)(buffer.data_.data());
waveOutPrepareHeader(hWaveOut_, &buffer.wavehdr_, sizeof(WAVEHDR)); // prepare block for playback
waveOutWrite(hWaveOut_, &buffer.wavehdr_, sizeof(WAVEHDR));
}

The described behavior can happen if you do not call
waveOutUnprepareHeader
to every buffer you used before you use
waveOutClose
The flagfield _dwFlags seems to indicate that the buffers are still enqueued (WHDR_INQUEUE | WHDR_PREPARED) try:
waveOutReset
before unprepare buffers.
After analyses your code, I found two problems/bugs which are not related to waveOut (funny, you use C++11 but the oldest media interface). You use a vector as buffer. During some calling operations, the vector is copied! One bug I found is:
typedef std::function<void(std::vector<short>)> CALLBACK_FN;
instead of:
typedef std::function<void(std::vector<short>&)> CALLBACK_FN;
which forces a copy of the vector.
Try to avoid using vectors if you expect to use it mostly as rawbuffer. Better use std::unique_pointer as buffer pointer.
Your callback in the recorder is not monitored by a mutex, nor does it check if a destructor was already called. The destructing happens during the callback (mostly) which leads to an exception.
For your test program, go back and use raw pointer and static callbacks before blaming waveOut. Your code is not bad, but the first bug already shows, that a small bug will lead to unpredictical errors. As you also organize your buffers in a std::array, I would search for bugs there. I guess, you make a unintentional copy of your whole buffer array, unpreparing the wrong buffers.
I did not have the time to dig deeper, but I guess those are the problems.

I managed to find my problem in the end, it was caused by multiple bugs and a deadlock. I will document what happened here so people can learn from this in the future
I was clued in to what was happening when I fixed the bugs in the sample:
call waveInStop() before waveInClose() in ~Recorder.cpp
wait for all buffers to have the WHDR_DONE flag before calling waveOutClose() in ~PcmPlayback.
After doing this, the sample worked fine and did not display the behavior of the WHDR_DONE flag never being marked.
In my main program, that behavior was caused by a deadlock that occurs in the following situation:
I have a vector of objects representing each peer I am streaming audio with
Each Object owns a Playback class
This vector is protected by a mutex
Recorder callback:
mutex.lock()
send audio packet to each peer.
Remove Peer:
mutex.lock()
~PcmPlayback
wait for WHDR_DONE flags to be marked
A deadlock occurs when I remove a peer, locking the mutex and the recorder callback tries to acquire a lock too.
Note that this will happen often because the playback buffer is usually (~4 * 20ms) while the recorder has a cadence of 20ms.
In ~PcmPlayback, the buffers will never be marked as WHDR_DONE and any calls to the WaveOut API will never return because the WaveOut API is waiting for the Recorder callback to complete, which is in turn waiting on mutex.lock(), causing a deadlock.

multi-threading limit?

I am writing a program using threads in c++ in linux.
Currently, I am just keeping an array of threads, and every time one second has elapsed, I check to see which have finished, and restart them. Is this bad? I need to keep this program running for a long time. As it is now, I am getting a code 11 after so many loops of restarting threads (the 100th loop in the last trial). I figured that reusing threads and making sure I only have a small number of them running at any one time, that I would not hit the limit. The array I am using only has a size of 8 (of course, I am not starting 8 each time, just those that have stopped).
Any ideas?
My code is below:
if ( loop_times == 0 || pthread_kill(threads[t],0) != 0 )
{
rc = pthread_create(&threads[t], NULL, thread_stall, (void *)NULL);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
thread_count++;
}
The loop_times variable is just so that I can get into the loop and start the threads the first time. Otherwise, I get a SEGFAULT because the threads haven't been started before.
Also, I have been wanting to see the value of PTHREAD_THREADS_MAX, but I can't print it (even when including limits.h)

If you want to use multiple threads...It better to go for thread pool.
Start a set of threads as detached ones and then through a queue you can send info to every thread so that it can work on that and wait for next input from you.

As it turns out, my problem was that I needed to pthread_join my thread before I restarted it each time. After this, I stopped getting a code 11 and stopped having "still reachable" memory when running it through Valgrind.

CreateThread failure on a longterm run

I'm writing a program in C++ using WINAPI to monitor certain directory for new files arriving, and send them in certain order. The files are derived from a live video stream, so there are 2 files in a unit - audio file and video file, and units should be sent in sequence. a. k. a. (1.mp3, 1.avi); (2.mp3, 2.avi)... Architecture is:
1) detect a new file added to the folder, insert file name to the input queue
2) organize files into units, insert units into unit queue
3) send unit by unit
But since I have to use monitoring file directory for files added there, I need to make sure that file is complete, a. k. a. it is ready to send, since the signal appears when the file is created, but it has yet to be filled with info and closed. So I pop file name from a input queue either when queue has more than 1 file (a. k. a. signal came for next file created, that means that previous file is ready to send) or on timeout(10 sec) so for 10 seconds any file should be done.
So in general this program runs and works properly. But, if I assume that the send procedure will take too long time, so the unit queue will grow. And after some number of units buffered in a unit queue the bug appears.
time[END] = 0;
time[START] = clock();
HANDLE hIOMutex2= CreateMutex (NULL, FALSE, NULL);
WaitForSingleObject( hIOMutex2, INFINITE );
hTimer = CreateThread(NULL, 0, Timer, time, 0, &ThreadId1);
if(hTimer == NULL)
printf("Timer Error\n");
ReleaseMutex(hIOMutex2);
ReadDirectoryChangesW(hDir, szBuffer, sizeof(szBuffer) / sizeof(TCHAR), FALSE, FILE_NOTIFY_CHANGE_FILE_NAME, &dwBytes, NULL, NULL);
HANDLE hIOMutex= CreateMutex (NULL, FALSE, NULL);
WaitForSingleObject( hIOMutex, INFINITE );
time[END] = clock();
TerminateThread(hTimer, 0);
ReleaseMutex( hIOMutex);
After around 800 units buffered in a queue, my program gives me "Time Error" message, if I'm right that means that program can't allocate thread. But in this code program terminates timer thread exactly after the file was created in a directory. So I'm kind of confused with this bug. Also interesting is that even with this time error, my program continue to send units as usual, so that doesn't look like a OS mistake or something different, it is wrong thread declaration/termination, at least it seems like that to me.
Also providing Timer code below if it is helpful.
DWORD WINAPI Timer(LPVOID in){
clock_t* time = (clock_t*) in;
while(TRUE){
if(((clock() - time[START])/CLOCKS_PER_SEC >= 10) && (!time[END]) && (!output.empty())){
Send();
if(output.empty()){
ExitThread(0);
}
}
else if((output.empty()) || (time[END])){
break;
}
else{
Sleep(10);
}
}
ExitThread(0);
return 0;
}
Please could anyone here give me some advise how to solve this bug? Thanks in advance.

Using TerminateThread is a bad idea in many ways. In your case, it makes your program fail because it doesn't release the memory for the thread stack. Failure comes when your program has consumed all available virtual memory and CreateThread() cannot reserve enough memory for another thread. Only ever use TerminateThread while exiting a program.
You'll have to do this a smarter way. Either by asking a thread to exit nicely by signaling an event or by just not consuming such an expensive system resource only for handling a file. A simple timer and one thread can do this too.

Win32 Overlapped Readfile on COM Port returning ERROR_OPERATION_ABORTED

Ok, one for the SO hive mind...
I have code which has - until today - run just fine on many systems and is deployed at many sites. It involves threads reading and writing data from a serial port.
Trying to check out a new device, my code was swamped with 995 ERROR_OPERATION_ABORTED errors calling GetOverlappedResult after the ReadFile. Sometimes the read would work, othertimes I'd get this error. Just ignoring the error and retrying would - amazingly - work without dropping any data. No ClearCommError required.
Here's the snippet.
if (!ReadFile(handle,&c,1,&read, &olap))
{
if (GetLastError() != ERROR_IO_PENDING)
{
logger().log_api(LOG_ERROR,"ser_rx_char:ReadFile");
throw Exception("ser_rx_char:ReadFile");
}
}
WaitForSingleObjectEx(r_event, INFINITE, true); // alertable, so, thread can be closed correctly.
if (GetOverlappedResult(handle,&olap,&read, TRUE) != 0)
{
if (read != 1)
throw Exception("ser_rx_char: no data");
logger().log(LOG_VERBOSE,"read char %d ( read = %d) ",c, read);
}
else
{
DWORD err = GetLastError();
if (err != 995) //Filters our ERROR_OPERATION_ABORTED
{
logger().log_api(LOG_ERROR,"ser_rx_char: GetOverlappedResult");
throw Exception("ser_rx_char:GetOverlappedResult");
}
}
My first guess is to blame the COM port driver, which I havent' used before (it's a RS422 port on a Blackmagic Decklink, FYI), but that feels like a cop-out.
Oh, and Vista SP1 Business 32-bit, for my sins.
Before I just put this down to "Someone else's problem", does anyone have any ideas of what might cause this?

How are you setting over the OVERLAPPED structure before the ReadFile? - I always zero them (other than the hEvent, obviously), which is perhaps part superstition, but I have a feeling that it's caused me a problem in the past.
I'm afraid blaming the driver (if it's non-MS and not just a tiny tweak from the reference) is not completely unrealistic. To write a COM driver is an incredibly complex thing, and the difficulty with testing it is that every application ever written uses the serial ports and their IOCTLs slightly differently.
Another common problem is not to set the whole port up - for example not calling SetCommTimeouts or SetupComm. I've no idea if you're making this sort of mistake, but I have met people who say they're not using timeouts when they actually mean that they didn't call SetCommTimeouts so they're using them but don't have a notion what they're set to...
This kind of stuff can be murder for 3rd-party COM drivers, because people have often got away with any old crap with the MS driver, and it doesn't always work the same with another device.

in addition to zeroing the OVERLAPPED, you might also check how you're setting olap.hEvent, that is, what are your arguments to CreateEvent? If you're creating an event that's pre-signalled (i.e. the third argument to CreateEvent is TRUE) I would expect an immediate return. Also, don't forget that if you specify manualReset (the second argument to CreateEvent) as FALSE, GetOverlappedResult() will helpfully clear the event for you - which might explain why it works the second time around.
Can't really tell from your snippet whether either of these affect you - hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js