Pooling PBOs and textures? - c++

I have an application which does a lot of GPGPU using Opengl and Pixel Buffer Objects to transfer and process data.
Currently I employ a pooling of these resources, basically I have a pool for every buffer dimensions and usage that my application uses. When the usage of resource finishes it returns to its respective pool for re-use. However, I'm starting to have seconds thoughts whether there is any is in this since I need "orphan" the PBOs before re-use to not interfere with ongoing transfers.
My question is whether there is any merit is in pooling resources such as PBOs and textures, or would it be just a good to simply allocate from OpenGL directly when needed?
Here is an example of what I am doing. Vice versa with textures.
std::shared_ptr<pbo> create_pbo(int size, bool write)
{
auto pool = pbo_pools[write][size];
std::shared_ptr<pbo> buffer;
if(!pool->try_pop(buffer))
buffer = ogl_thread_.invoke([=]{return new pbo(size, write);});
return spl::shared_ptr<pbo>(buffer.get(), [=](pbo*) mutable
{
ogl_thread_.begin_invoke([=]() mutable
{
if(write)
buffer->map();
else // read
buffer->unmap();
pool->push(buffer);
});
});
}

I'm starting to have seconds thoughts whether there is any is in this since I need "orphan" the PBOs before re-use to not interfere with ongoing transfers.
No you don't have to. That's the nice thing about PBOs: You can submit new data into them, while a call to glTex(Sub)Image may still be reading from them, without the read operation being corrupted.

Related

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).
If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.
So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

Changing VkBuffers used in command with out re-building the command buffers?

This might be an X Y problem, so here's my issue:
I'm trying to send a command buffer to the GPU that adds values to a shader buffer, eg:
#version 450
#define INPUT_ARRAY_SIZE 1024
layout(std430, binding = 1) buffer InputArray{
float array[ ];
}input_array;
void main()
{
uint index = gl_GlobalInvocationID.x;
if (index >= INPUT_ARRAY_SIZE){
return;
}
InputArray.input_array[index] += 3;
}
I would like to be able to swap out the VkBuffer I use to back the shader buffer with other buffers. ie:
void addValue(device, queue, command_buffer, buffer);
or
void addValue(device, queue, command_buffer, descriptor_set);
where I would swap out buffer for other buffers I want to add values to.
Unfortunately I don't see a way to do that with out re-recording my command buffer. As far as I can tell my only options for minimizing the command buffer impact (which is large when my invocations take nano seconds), is to use secondary command buffers, and use pipeline cache some how. Otherwise I would have to create a command buffer for every single new buffer, which is not feasible when I have more than 100 commands. It doesn't seem to be possible to use VkUpdateDescriptorSets with out re-recording as well.
Is there a way to use pre-recorded command buffers, and change the VkBuffer used behind the shader buffer at will with out re-recording the command?
Not without the EXT_descriptor_index extension. Descriptor values (the location of the GPU resources they represent) are supposed to be baked into the CB at write time, not read from some external source.
Even with descriptor index, you need to ensure that the CB is not being executed before you can update the descriptor. So that would require a GPU/CPU sync (which may or may not be bad, depending on your semaphore/submission code structure).
Otherwise I would have to create a command buffer for every single new buffer, which is not feasible when I have more than 100 commands.
You should not put each command in its own buffer. You should bundle as many commands together as possible.
In general, the cost of building command buffers is pretty low. Coupled with threading their construction, they shouldn't be your primary concern here. Especially when the number of commands is as low as "more than 100 commands;" Vulkan users routinely issue thousands of commands into CBs repeatedly, every frame.

AT command response parser

I am working on my own implementation to read AT commands from a Modem using a microcontroller and c/c++
but!! always a BUT!! after I have two "threads" on my program, the first one were I am comparing the possible reply from the Moden using strcmp which I believe is terrible slow
comparing function
if (strcmp(reply, m_buffer) == 0)
{
memset(buffer, 0, buffer_size);
buffer_size = 0;
memset(m_buffer, 0, m_buffer_size);
m_buffer_size = 0;
return 0;
}
else
return 1;
this one works fine for me with AT commands like AT or AT+CPIN? where the last response from the Modem is "OK" and nothing in the middle, but it is not working with commands like AT+CREG?, wheres it responses:
+REG: n,n
OK
and I am specting for "+REG: n,n" but I believe strncpy is very slow and my buffer data is replaced for "OK"
2nd "thread" where it enables a UART RX interruption and replaces my buffer data every time it receives new data
Interruption handle:
m_buffer_size = buffer_size;
strncpy(m_buffer, buffer, buffer_size + m_buffer_size);
Do you know any out there faster than strcmp? or something to improve the AT command responses reading?
This has the scent of an XY Problem
If you have seen the buffer contents being over written, you might want to look into a thread safe queue to deliver messages from the RX thread to the parsing thread. That way even if a second message arrives while you're processing the first, you won't run into "buffer overwrite" problems.
Move the data out of the receive buffer and place it in another buffer. Two buffers is rarely enough, so create a pool of buffers. In the past I have used linked lists of pre-allocated buffers to keep fragmentation down, but depending on the memory management and caching smarts in your microcontroller, and the language you elect to use, something along the lines of std::deque may be a better choice.
So
Make a list of free buffers.
When a the UART handling thread loop looks something like,
Get a buffer from the free list
Read into the buffer until full or timeout
Pass buffer to parser.
Parser puts buffer in its own receive list
Parsing sends a signal to wake up its thread.
Repeat until terminated. If the free list is emptied, your program is probably still too slow to keep up. Perhaps adding more buffers will allow the program to get through a busy period, but if the data flow is relatively constant and the free list empties out... Well, you have a problem.
Parser loop also repeats until terminated looks like:
If receive list not empty,
Get buffer from receive list
Process buffer
Return buffer to free list
Otherwise
Sleep
Remember to protect the lists from concurrent access by the different threads. C11 and C++11 have a number of useful tools to assist you here.

Concatenate data in an array in C ++

I'm working on software for processing audio in real time in C++ with Qt. I need that requirements are minimized.
Defining a temporary buffer 40ms, launching our device with a sampling frequency Fs = 8000Hz, every 320 samples entered a feature called Data Processing ().
The idea is to have a global buffer that stores the 10s last recorded, 80000 samples.
This Buffer in each iteration eliminates the initial 320 samples and looped at the end, 320 new samples. Thus the buffer is updated and the user can observe the real-time graphical representation of the recorded signal.
At first I thought of using QVector (equivalent to std::vector but for Qt) for this deployment, thus we reduce the process a few lines of code
int NUM_POINTS=320;
DatosTemporales.erase(DatosTemporales.begin(),DatosTemporales.begin()+NUM_POINTS);
DatosTemporales+= (DatosNuevos); // Datos Nuevos con un tamaƱo de NUM_POINTS
In each iteration we create a vector of 80000 samples in addition to free some positions so requires some processing time. An alternative for opting was the use of * double, and iterations a loop:
for(int i=0;i<80000;i++){
if(i<80000-NUM_POINTS){
aux=DatosTemporales[i];
DatosTemporales[i+NUM_POINTS]=aux;
}else{
DatosTemporales[i]=DatosNuevos[i-NUN_POINTS];
}
}
Does fails. I think the best way is to use dynamic memory. Implementing this process by pointers. Could anyone give me some idea how to implement it?
It sounds like what you are looking for is a circular buffer.
https://www.google.com/search?q=qcircularbuffer
https://qt.gitorious.org/qt/qtbase/merge_requests/60
And it looks like you only need the header file and you should be good to go.
A similar tool that is already in the Qt data set is found here:
http://doc.qt.io/qt-5/qcontiguouscache.html#details
The advantage of using a system like these presented, is that they don't need to have dynamic memory, it just needs to move the head and the tail pointers.
Hope that helps.

Restarting Streaming OpenAL Source?

Why does my streaming OpenAL source somtimes go to AL_STOPPED state, forcing me to call alSourcePlay? This usually happens when I do not call send fast enough, i.e. in debug mode. Does the oal source automatically stop when it doesn't have enough queue buffers? How do I avoid that?
void send(audio_buffer audio) override
{
ALenum state;
alGetSourcei(source_, AL_SOURCE_STATE,&state);
if(state != AL_PLAYING)
alSourcePlay(source_); // This happens sometimes, usually when "send" is not called fast enough.
ALuint buffer = 0;
alSourceUnqueueBuffers(source_, 1, &buffer);
if(buffer)
{
alBufferData(buffer, AL_FORMAT_STEREO16, audio.data(), static_cast<ALsizei>(audio.size()*sizeof(int16_t)), 48000);
alSourceQueueBuffers(source_, 1, &buffer);
}
else
LOG << "Dropped audio.";
}
It sounds like your basic problem is that your audio stream is starved. There are a few options you can use to mitigate this, but they all have their own side effects:
(1) You can configure it to play from a looping buffer, to which you are supplying the relevant data. The downside to this is that it will audibly repeat itself if you starve the buffer too long, but it will have some better performance characteristics (fragmentation, etc).
(2) You can increase the send buffer size. This will only cover up small problems, and potentially increases the latency in dynamic content.
(3) Finally, you can thread the audio send operation, that way so long as the audio thread isn't starved, it can continue to send data in the background.
The high production / quality solution probably involes all three of these. Sorry for the lack of OpenAL specific terminology, but every audio system I've seen has these capabilities.