Vulkan/GLSL shader best way to allocate buffer with previously unknown output length - glsl

I'm writing a compute shader that will output some unknown amount (there is a theoretical upper bound, but it is huge, compared to expected values) of data into a storage buffer.
I have found a way to achieve this: First run a simplified version of the shader that just counts the number of output values. Then use the result from that to allocate a buffer that is large enough and run the real shader writing into this buffer.
But to me that feels inefficient, having a rather big overhead from recording and submitting two command buffers successively, both doing essentially the same, and retrieving the counter from the GPU in between. Is there a better way to do this? From what I have read, maybe Dynamic Storage Buffers could be a solution, but I can't find much information on how they work (an example would be nice) or what they are really intended to do.

Related

How to suballocate buffers in Vulkan

A recommended approach for memory management in Vulkan is sub-allocation of buffers, for instance see the image below.
I'm trying to implement "the good" approach. I have a system in place that can tell me where within a Memory Allocation is available, so I can bind a sub area of a single large buffer.
However, I can't find the mechanism to do this, or am just misunderstanding what is happening, as the bind functions take a buffer as input, and an offset. I can't see how to specify the size of the binding other than through the existing buffer.
So I have a few questions I guess:
are the dotted rectangles in the image below just bindings, or are they additional buffers?
if they are bindings, how do I tell Vulkan (ideally using VMA) to use that subsection of the buffer?
if they are additional buffers, how do I create them?
if neither, what are they?
I have read up on a few custom allocators, but they seem to follow the "bad" approach, returning offsets into large allocations for binding, so still plenty of buffers but lower allocation counts.
To be clear, I am not using custom allocator callbacks other than through VMA; the "system" to which I refer above sits on top of the VMA calls.
Any pointers much appreciated!
are the dotted rectangles in the image below just bindings, or are they additional buffers?
They represent the actual data. So the "Index" block is the range of storage that contains vertex indices.
if they are bindings, how do I tell Vulkan (ideally using VMA) to use that subsection of the buffer?
That depends on the particular nature of how you're using that VkBuffer as a resource. Generally speaking, every function that uses a VkBuffer as a resource takes a byte offset that represents where to start reading from. Many such functions also take a size which coupled with the offset represents the full quantity of data that can be read through that particular resource.
For example, vkCmdBindVertexBuffers takes an array of VkBuffers, and for each VkBuffer it also takes a byte offset that represents the starting point for that vertex buffer. VkDescriptorBufferInfo, the structure that represents a buffer used by a descriptor, takes a VkBuffer, a byte offset, and a size.
The vertex buffer (and index buffer) bindings don't have a size, but they don't need one. Their effective size is defined by the rendering command used with them (and the index data being read by it). If you render using 100 32-bit indices, then the expectation is that the index buffer's size, minus the starting offset, should be at least 400 bytes. And if it isn't, UB results.

glMapBufferRange() downloads full buffer?

I noticed a 15ms slow down when calling some openGL functions. After some tests I do believe I narrowed down the issue. I do have a buffer of couple MBytes (containing mostly particles). I do need to add some particles sometimes. To do so I bind the buffer, get the current number of particles to know the offset whereto write, then write the particles. As expected, the slow down is on the reading part. (For this problem, do assume that keeping track of the number of particles on the CPU side is impossible.)
glBindBuffer(GL_ARRAY_BUFFER, m_buffer);
GLvoid* rangePtr = glMapBufferRange( //This function takes 15ms to return
GL_ARRAY_BUFFER,
m_offsetToCounter,
sizeof(GLuint),
1);
if(rangePtr != NULL)
value = *(GLuint*) rangePtr;
m_functions->glBindBuffer(GL_ARRAY_BUFFER, 0);
I assumed by providing a really limited size (here a GLuint), only a GLuint would be downloaded. However, by reducing drastically the size of my buffer to 200 KBytes, the execution time of the function drops to 8ms.
Two question :
glMapBufferRange as well as glGetBufferSubData do download the full buffer even though the user only ask for a portion of it ?
The math doesn't add up, any idea why ? There is still 8ms to download a really small buffer. The execution time equation looks like y = ax + b where b is 7-8 ms. When I was trying to find the source of the problem before suspecting the buffer size, I also found that glUniform* functions took ~10ms as well. But only the first call. If there is multiple glUniform* calls one after the other, only the first one takes a lot of time. The others are instantaneous. And when the buffer will be accessed in reading, there is no download time as well. Is glUniform* triggering something ?
I'm using the Qt 5 API. I would like to be sure first that I'm using openGL properly before thinking it might be Qt's layer that causes the slow down and re-implement the whole program with glu/glut.
8ms sounds like an awful lot of timeā€¦ How do you measure that time?
glMapBufferRange as well as glGetBufferSubData do download the full buffer even though the user only ask for a portion of it?
The OpenGL specification does not define in which way buffer mapping is to be implemented by the implementation. It may be a full download of the buffers contents. It may be a single page I/O-Memory mapping. It may be anything the makes the contents of the buffer object appear in the host process address space.
The math doesn't add up, any idea why?
For one thing the smallest size of a memory map is the system's page size. Either if it's done by a full object copy or by a I/O-Memory mapping or something entirely different, you're always dealing with memory chunks at least a few kiB in size.
I'm using the Qt 5 API
Could it be that you're using the Qt5 OpenGL functions class? AFAIK this class does load function pointers on demand, so the first invocation of a function may trigger a chain of actions that take a few moments to complete.

Buffer drawing in OpenGL

In this question I'm interested in buffer-drawing in OpenGL, specifically in the tradeoff of using one buffer per data set vs one buffer for more than one data set.
Context:
Consider a data set of N vertices each represented by a set of attributes (e.g. color, texture, normals).
Each attribute is represented by a type (e.g. GLfloat, GLint) and a number of components (2, 3, 4). We want to draw this data. Schematically,
(non-interleaved representation)
data set
<-------------->
a_1 a_2 a_3
<---><---><---->
a_i = attribute; e.g. a2 = (3 GLfloats representing color, thus 3*N Glfloats)
We want to map this into the GL state, using glBufferSubData.
Problem
When mapping, we have to keep track of the data in our memory because glBufferSubData requires a start and size. This sounds to me like an allocation problem: we want to allocate memory and keep track of its position. Since we want fast access to it, we would like the data to be in the same memory position, e.g. with a std::vector<char>. Schematically,
data set 1 data set 2
<------------><-------------->
(both have same buffer id)
We commit to the gl state as:
// id is binded to one std::vector<char>, "data".
glBindBuffer(target, id);
// for each data_set (AFTER calling glBindBuffer).
// for each attribute
// "start": the start point of the attribute.
// "size": (sizeof*components of the attribute)*N.
glBufferSubData(target, start, size, &(data[0]))
(non non-interleaved for the sake of the code).
the problem arises when we want to add or remove vertices, e.g. when LOD changes. Because each data set must be a chunk, for instance to allow interleaved drawing (even in non-interleaved, each attribute is a chunk), we will end up with fragmentation in our std::vector<char>.
On the other hand, we can also set one chunk per buffer: instead of assigning chunks to the same buffer, we assign each chuck, now a std::vector<char>, to a different buffer. Schematically,
data set 1 (buffer id1)
<------------>
data set 2 (buffer id2)
<-------------->
We commit data to the gl state as:
// for each data_set (BEFORE calling glBindBuffer).
// "data" is the std::vector<char> of this data_set.
// id is now binded to the specific std::vector<char>
glBindBuffer(target, id);
// for each attribute
// "start": the start point of the attribute.
// "size": (sizeof*components of the attribute)*N.
glBufferSubData(target, start, size, &(data[0]))
Questions
I'm learning this, so, before any of the below: is this reasoning correct?
Assuming yes,
Is it a problem to have an arbitrary number of buffers?
Is "glBindBuffer" expected to scale with the number of buffers?
What are the major points to take into consideration in this decision?
It is not quite clear if you asking about performance trade-offs. But I will answer in this key.
Is it a problem to have an arbitrary number of buffers?
It is a problem came from a dark medieval times when pipelines was fixed and rest for now due to backward compatibility reasons. glBind* is considered as a (one of) performance bottleneck in modern OpenGL drivers, caused by bad locality of references and cache misses. Simply speaking, cache is cold and huge part of time CPU just waits in driver for data transferred from main memory. There is nothing drivers implementers can do with current API. Read Nvidia's short article about it and their bindless extensions proposals.
2. Is "glBindBuffer" expected to scale with the number of buffers?
Surely, the more objects (buffers in your case), more bind calls, more performance loss in driver. But merged, huge resource objects are less manageable.
3. What are the major points to take into consideration in this decision?
Only one. Profiling results ;)
"Premature optimization is the root of all evil", so try to stay as much objective as possible and believe only in numbers. When numbers will go bad, we can think of:
"Huge", "all in one" resources:
less bind calls
less context changes
harder to manage and debug, need some additional code infrastructure (to update resource data for example)
resizing (reallocation) very slow
Separate resources:
more bind calls, loosing time in driver
more context changes
easier to manage, less error-prone
easy to resize, allocate, reallocate
In the end, we can see have performance-complexity trade-off and different behavior when update data. To stick one approach or another, you must:
decide, would you like to keep things simple, manageable or add complexity and gain additional FPS (profile in graphics profilers to know how much. Does it worth it?)
know how often you resize/reallocate buffers (trace API calls in graphics debuggers).
Hope it helps somehow ;)
If you like theoretical assertions like this, probably you will be interested in another one, about interleaving (DirectX one)

Fastest way to transfer vertex data to GPU in OpenGL / CUDA

I have to upload just specific elements (more thousands) of the vertex array on every frame -
or the whole region between the first and last changed value, however it is pretty inefficient, due to it has the probability of re-upload the whole array, anyway many unchanged values will be uploaded.
The question also includes that what are the fastest ways to upload vertex data to the GPU.
There are several ways to do it:
glBufferData() / glBufferSubData() // Standard upload to buffer
glBufferData() // glBufferData with double buffer
glMapBuffer() // Mapping video memory
cudaMemcpy() // CUDA memcopy from host to device vertex buffer
Which will be the fastest one? I'm especially concerned about the CUDA way and that's difference to standard OpenGL methods. Is it faster than glBufferData() or glMapBuffer()?
The speed of copying the same data from host to device should be similar no matter which copy API you use.
However the size of the data block to be copied matters a lot. Here is a benchmark showing the relationship between the data size and the copy speed using CUDA's cudaMemcpy().
CUDA - how much slower is transferring over PCI-E?
You could simply estimate the average speed from the above figure if you know the number of copy API you will invoke and the data size of each copy.
When the element size is small and the number of elements is large, copying only changed elements individually from host to device by invoking the copy API thousands of times is definitely not a good idea.

How to get a list (or subset) from an OpenCL Kernel?

I have a large array with 2^20 ulongs on it. This little OpenCL kernel flows through it like a charm. Yet, I have absolutely no idea (and google hasn't helped here) how to return a small number of items (2^10) from it.
What I'm looking for is a fixed-sized list with at most 1024 items that have hamming distance (popcount) smaller than a given number. The list order doesn't matter, so perhaps I should be asking for a subset of these 2**20 items.
Since the output is expected to be much smaller than the input, using a global index in the output through atomic access will not be too ineffective. You need to pass a buffer containing a single uint, initially set to 0:
__kernel void K(...,__global uint * outIndex,...)
{
...
if (selected)
{
uint index = atomic_inc(outIndex); // or atom_inc if using OpenCL 1.0 extension
out[index] = value;
}
}
A list as such is not supported with OpenCL. OpenCL is a kind of standard C with some extensions and some limitation. You can only operate on buffers (aka arrays).
What you might look for is a global memory buffer which you need to allocate before you run the kernel. In this you can put your results in and with an clEnqueueReadBuffer you can retrieve your results.
Well, there is a way, through some hacks. I forked pyopencl.algorithm and created a new method, sparse_copy_if(), that returns the exact-sized buffer I need, as if it were a list with items being appended to. I will document it, and submit a patch to Andreas.
If your buffers are too large, though, there is a way to improve performance even more: I followed Rick's suggestion above, created a hash table, and threw the desired results in there. (Note that there's always risk of collision, so the hash table buffer/array has to be orders of magnitude larger than your expected output).
Then, I run sparse_copy_if() on the hash table buffer and receive nothing but a perfectly-sized buffer.
In conclusion:
I have a kernel scanning a 1,000,000-sized buffer. It computes results for all of them but doesn't separate the results I want.
These desired results are then thrown in a ~25,000 buffer (hash table, significantly smaller then the original data).
Then, by running sparse_copy_if() on the hash table buffer, you get the desired output---almost as if it were a list in which items could have been appended to.
sparse_copy_if(), of course, has the overhead of creating the perfectly-sized buffers, and copying data to them. But I've found that this overhead generally compensates, as you are making now (low-latency) transfers of small buffers/arrays from device back to host.
Code for testing sparse_copy_if() performance versus copy_if().