Will OpenGL release the part of BO after glInvalidBufferSubData()? - opengl

For example,I have a BO called bo_vertex with 4*4 vertex data,first I render all the vertexes , then I only need the first four vertexes for next frame, so I invalidate the data after four vertexes.Now what will OpenGL do to the part specified in glInvalidateBufferSubData()? If I need a BO for 16 vertexes data again, can I reuse bo_vertex directly,or should call glBufferData() to reallocate the storage for bo_vertex?

It's worth reading the original extension specification for this ARB_invalidate_subdata. Invalidating a buffer object (or a region of it) as well as a framebuffer object attachment or a texture region will not "shrink" the size of the respective resource. It is only for signaling to the OpenGL implementation that a particular (sub)region of that resource is not to survive across multiple memory barriers that the OpenGL driver inserts automatically between OpenGL calls writing to and/or reading from those resources. It is very much like the explicit Vulkan memory barriers, however OpenGL was always thought to do this automatically and in most cases OpenGL had to be conservative when guessing whether the user still needed a resource across certain boundaries (like a swapbuffers).
With ARB_invalidate_subdata it is intended that a user/client application can hint to the OpenGL driver that certain memory barriers do not need to be inserted into the command stream to e.g. move memory from on-chip framebuffer or local memory to other memory regions intended for different accesses to those resources.

Related

What does ID3D11Device::CreateBuffer do under the hood?

I know this function create a "buffer." But what exactly is a buffer? Is it a COM object in memory? If it is, then in my understanding, this function takes in a descriptor and some initial data to create this COM object in memory, and then set the ID3D11Buffer pointer pointed by the input ID3D11Buffer** to the interface in the newly created COM object. Once the COM object is created, the initializing data is not needed any more and we can delete them. And once we call ID3DBuffer::Release(), the underline COM object will be destroyed. Is my understanding correct?
CreateBuffer returns a COM interface object ID3D11Buffer*. As long as it has a non-zero reference count (it starts at 1; each call to AddRef adds 1, each call to Release subtracts 1) then whatever resources it controls are active.
As to where exactly the resources are allocated, it really depends. You may find this article interesting as it covers different ways Direct3D allocates resources.
UPDATE: You should also read this Microsoft Docs introduction to the subset of COM used by DirectX.
In the general case, a buffer is a continuous, managed, area of memory.
Memory is a large set of addresses of read/writable elements (one element per address, of course), say 230 addresses of elements of 8-bit makes a 1GiB memory.
If there is only a single program and it uses these addresses statically (e.g. addresses from 0x1000 to 0x2000 are used to store the list of items) then memory doesn't need to be managed and in this context a buffer is just a continuous range of addresses.
However, if there are multiple programs or a program memory usage is dynamic (e.g. it depends on how many items it's been asked to read from input) then memory must be managed.
You must keep track of which ranges are already in use and which are not. So a buffer becomes a continuous range of addresses with their attributes (e.g. if it's in use or not).
The attributes of a buffer can vary a lot between the different memory allocators, in general, we say that a buffer is managed because we let the memory allocator handle it: find a suitable free range, mark it used, tell it if it can move the buffer aftward, mark it free when where are finished.
This is true for every memory that is shared, so it is certainly true for the main memory (RAM) and the graphic memory.
This is the memory inside the graphic card, that is accessed just like the main memory (from the CPU point of view).
What CreateBuffer return is a COM object in the main memory that contains the metadata necessary to handle the buffer just allocated.
It doesn't contain the buffer itself because this COM object is always in memory while the buffer usually is not (it is in the graphic memory).
CreateBuffer asks the graphic driver to find a suitable range of free addresses, in the memory asked, and fill in some metadata.
Before the CPU can access the main memory it is necessary to set up some metadata tables (the page tables) as part of its protection mechanism.
This is also true if the CPU needs to access the graphic memory (with possibly a few extra steps, for managing the MMIO if necessary).
The GPU also has page tables, so if the main memory has to be accessed by the GPU these page tables must also have to be created.
You see that it's important to know how the buffer will be used.
Another thing to consider is that the GPUs use highly optimized memory format - for example, the buffer used for a surface can be pictured as a rectangular area of memory.
The same is true for the buffer used by a texture.
However the twos are stored differently: the surface is stored linearly, each row after another, while the texture buffer is tiled (it's like it's made of many, say, 16x16 surfaces stored linearly one after the other).
This makes sampling and filtering faster.
Also, some GPU may need to have texture images on a specific area of memory, vertex buffer in another and so on.
So it's important to give the graphic driver all the information it needs to make the best choice when allocating a buffer.
Once the buffer has been found, the driver (or the D3D runtime) will initialize the buffer if requested.
It can do this by copying the data or by aliasing through the page tables (if the pitch allows for it) and eventually using some form of Copy-On-Write.
However it does that, the source data are not needed anymore (see this).
The COM object returned by CreateBuffer is a convenient proxy, then it is disposed of, thanks to the usual come AddRef/Release mechanism, it also asks the graphic driver to deallocate the buffer.

How do UBOs/SSBOs differ from Vulkan's shader memory bindings?

In the article on Imagination's website, I've read the following paragraph:
For example, there are no glUniform*() equivalent entry points in Vulkan; instead, writing to GPU memory is the only way to pass data to shaders.
When you call glUniform*(), the OpenGL ES driver typically needs to allocate a driver managed buffer and copy data to it, the management of which incurs CPU overhead. In Vulkan, you simply map the memory address and write to that memory location directly.
Is there any difference between that and using Uniform Buffers? They are also allocated explicitely and can carry arbitrary data. Since Uniform Buffers are quite limited in size, perhaps Shader Storage Buffers are a better analogy.
From what I understand, this is not glUniform*() specific: glUniform*() is merely an example used by the author of the article to illustrate the way Vulkan works with regards to communication between the host and the GPU.
When you call glUniform*(), the OpenGL ES driver typically needs to allocate a driver managed buffer and copy data to it, the management of which incurs CPU overhead.
In this scenario, when a user calls glUniform*() with some data, that data is first copied to a buffer owned by the OpenGL implementation. This buffer is probably pinned, and can then be used by the driver to transfer the data through DMA to the device. That's two steps:
Copy user data to driver buffer;
Transfer buffer contents to GPU through DMA.
In Vulkan, you simply map the memory address and write to that memory location directly.
In this scenario, there is no intermediate copy of the user data. You ask Vulkan to map a region into the host's virtual address space, which you directly write to. The data gets to the device through DMA in a completely transparent way for the user.
From a performance standpoint, the benefits are obvious: zero copy. It also means the Vulkan implementation can be simpler, as it does not need to manage an intermediate buffer.
As the specs have not been released yet, here's a fictitious example of what it could look like:
// Assume Lights is some kind of handle to your buffer/data
float4* lights = vkMap(Lights);
for (int i = 0; i < light_count; ++i) {
// Goes directly to the device
lights[i] = make_light(/* stuff */);
}
vkUnmap(lights);

Passing buffer memory mapped pointer to glTex(Sub)Image2D. Is texture upload asynchronous?

Suppose I map a buffer, with
map_ptr = glMapBuffer (..) (The target shouldn't matter, but let's say its GL_TEXTURE_BUFFER)
Next I upload texture data with:
glTexImage2D(..., map_ptr), passing map_ptr as my texture data. (I don't have a GL_PIXEL_UNPACK_BUFFER bound)
Semantically, this involves copying the data from the buffer's data store to the texture object's data store, and the operation can be accomplished with a GPU DMA copy.
But what actually happens? Is the data copied entirely on the GPU, or does the CPU read and cache the mapped memory, and then write back to GPU at a separate GPU memory location? I.e. is the copy asynchronous, or does the CPU synchronously coordinate the copy, utilizing CPU cycles?
Is the answer to that implementation dependent? Does it depend on whether the OpenGL driver is intelligent enough to recognize the data pointer passed to glTexImage2D a GPU memory mapped pointer, and that a round-trip to the CPU is unnecessary? If so, how common is this feature in prevalent drivers today?
Also, what about the behaviour for an OpenCL buffer whose memory was mapped, i.e:
map_ptr = clEnqueueMapBuffer(..) (OpenCL buffer mapped memory)
and map_ptr was passed to glTexImage2D?
What you do there is simply undefined behavior as per the spec.
Pointer values returned by MapBufferRange may not be passed as parameter
values to GL commands. For example, they may not be used to specify array
pointers, or to specify or query pixel or texture image data; such
actions produce undefined results, although implementations may not
check for such behavior for performance reasons.
Let me quote from the GL_ARB_vertex_buffer_object extension spec, which originally introduced buffer objects and mapping operations (emphasis mine):
Are any GL commands disallowed when at least one buffer object is mapped?
RESOLVED: NO. In general, applications may use whatever GL
commands they wish when a buffer is mapped. However, several
other restrictions on the application do apply: the
application must not attempt to source data out of, or sink
data into, a currently mapped buffer. Furthermore, the
application may not use the pointer returned by Map as an
argument to a GL command.
(Note that this last restriction is unlikely to be enforced in
practice, but it violates reasonable expectations about how
the extension should be used, and it doesn't seem to be a very
interesting usage model anyhow. Maps are for the user, not
for the GL.)

OpenGL Texture and Object Streaming

I have a need to stream a texture (essentially a camera feed).
With object streaming, the following scenarios seem to be arise:
Is the new object's data store larger, smaller or same size as the old one?
Subset of or whole texture being updated?
Are we streaming a buffer object or texture object (any difference?)
Here are the following approaches I have come across:
Allocate object data store (either BufferData for buffers or TexImage2D for textures) and then each frame, update subset of data with BufferSubData or TexSubImage2D
Nullify/invalidate the object after the last call (eg. draw) that uses the object either with:
Nullify: glTexSubImage2D( ..., NULL), glBufferSubData( ..., NULL)
Invalidate: glBufferInvalidate(), glMapBufferRange​ with the GL_MAP_INVALIDATE_BUFFER_BIT​, glDeleteTextures ?
Simpliy reinvoke BufferData or TexImage2D with the new data
Manually implement object multi-buffering / buffer ping-ponging.
Most immediately, my problem scenario is: entire texture being replaced with new one of same size. How do I implement this? Will (1) implicitly synchronize ? Does (2) avoid the synchronization? Will (3) synchronize or will a new data store for the object be allocated, where our update can be uploaded without waiting for all drawing using the old object state to finish? This passage from the Red Book V4.3 makes be believe so:
Data can also be copied between buffer objects using the
glCopyBufferSubData() function. Rather than assembling chunks of data
in one large buffer object using glBufferSubData(), it is possible to
upload the data into separate buffers using glBufferData() and then
copy from those buffers into the larger buffer using
glCopyBufferSubData(). Depending on the OpenGL implementation, it may
be able to overlap these copies because each time you call
glBufferData() on a buffer object, it invalidates whatever contents
may have been there before. Therefore, OpenGL can sometimes just
allocate a whole new data store for your data, even though a copy
operation from the previous store has not completed yet. It will then
release the old storage at a later opportunity.
But if so, why the need for (2)[nullify/invalidates]?
Also, please discuss the above approaches, and others, and their effectiveness for the various scenarios, while keeping in mind atleast the following issues:
Whether implicit synchronization to object (ie. synchronizing our update with OpenGL's usage) occurs
Memory usage
Speed
I've read http://www.opengl.org/wiki/Buffer_Object_Streaming but it doesn't offer conclusive information.
Let me try to answer at least a few of the questions you raised.
The scenarios you talk about can have a great impact on the performance on the different approaches, especially when considering the first point about the dynamic size of the buffer. In your scenario of video streaming, the size will rarely change, so a more expensive "re-configuration" of the data structures you use might be possible. If the size changes every frame or every few frames, this is typically not feasable. However, if a resonable maximum size limit can be enforced, just using buffers/textures with the maximum size might be a good strategy. Neither with buffers nor with textures you have to use all the space there is (although there are some smaller issues when you do this with texures, like wrap modes).
3.Are we streaming a buffer object or texture object (any difference?)
Well, the only way to efficiently stream image data to or from the GL is to use pixel buffer objects (PBOs). So you always have to deal with buffer objects in the first place, no matter if vertex data, image data or whatever data is to be tranfered. The buffer is just the source for some glTex*Image() call in the texture case, and of course you'll need a texture object for that.
Let's come to your approaches:
In approach (1), you use the "Sub" variant of the update commands. In that case, (parts of or the whole) storage of the existing object is updated. This is likely to trigger an implicit synchronziation ifold data is still in use. The GL has basically only two options: wait for all operations (potentially) depending on that data to complete, or make an intermediate copy of the new data and let the client go on. Both options are not good from a performance point of view.
In approach (2), you have some misconception. The "Sub" variants of the update commands will never invalidate/orphan your buffers. The "non-sub" glBufferData() will create a completely new storage for the object, and using it with NULL as data pointer will leave that storage unintialized. Internally, the GL implementation might re-use some memory which was in use for earlier buffer storage. So if you do this scheme, there is some probablity that you effectively end up using a ring-buffer of the same memory areas if you always use the same buffer size.
The other methods for invalidation you mentiond allow you to also invalidate parts of the buffer and also a more fine-grained control of what is happening.
Approach (3) is basically the same as (2) with the glBufferData() oprhaning, but you just specify the new data directly at this stage.
Approach (4) is the one I actually would recommend, as it is the one which gives the application the most control over what is happening, without having to relies on the GL implementation's specific internal workings.
Without taking synchronization into account, the "sub" variant of the update commands is
more efficient, even if the whole data storage is to be changed, not just some part. That is because the "non-sub" variants of the commands basically recreate the storage and introduce some overhead with this. With manually managing the ring buffers, you can avoid any of that overhead, and you don't have to rely in the GL to be clever, by just using the "sub" variants of the updates functions. At the same time, you can avoid implicit synchroniztion by only updating buffers which aren't in use by th GL any more. This scheme can also nicely be extenden into a multi-threaded scenario. You can have one (or several) extra threads with separate (but shared) GL contexts to fill the buffers for you, and just passing the buffer handlings to the draw thread as soon as the update is complete. You can also just map the buffers in the draw thread and let the be filled by worker threads (wihtout the need for additional GL contexts at all).
OpenGL 4.4 introduced GL_ARB_buffer_storage and with it came the GL_MAP_PERSISTEN_BIT for glMapBufferRange. That will allow you to keep all of the buffers mapped while they are used by the GL - so it allows you to avoid the overhead of mapping the buffers into the address space again and again. You then will have no implicit synchronzation at all - but you have to synchronize the operations manually. OpenGL's synchronization objects (see GL_ARB_sync) might help you with that, but the main burden on synchronization is on your applications logic itself. When streaming videos to the GL, just avoid re-using the buffer which was the source for the glTexSubImage() call immediately and try to delay its re-use as long as possible. You are of course also trading throughput for latency. If you need to minimize latency, you might to have to tweak this logic a bit.
Comparing the approaches for "memory usage" is really hard. There are a lot of of implementation specific details to consider here. A GL implementation might keep some old buffer memories around for some time to fullfill recreation requests of the same size. Also, an GL implementation might make shadow copies of any data at any time. The approaches which don't orphan and recreate storages all the time in principle expose more control of the memory which is in use.
"Speed" itself is also not a very useful metric. You basically have to balance throughput and latency here, according to the requirements of your application.

What is the purpose of OpenGL texture buffer objects?

We use buffer objects for reducing copy operations from CPU-GPU and for texture buffer objects we can change target from vertex to texture in buffer objects. Is there any other advantage here of texture buffer objects? Also, it does not allow filtering, is there any disadvantage of this?
A buffer texture is similar to a 1D-texture but has a backing buffer store that's not part of the texture object (in contrast to any other texture object) but realized with an actual buffer object bound to TEXTURE_BUFFER. Using a buffer texture has several implications and, AFAIK, one use-case that can't be mapped to any other type of texture.
Note that a buffer texture is not a buffer object - a buffer texture is merely associated with a buffer object using glTexBuffer.
By comparison, buffer textures can be huge. Table 23.53 and following of the core OpenGL 4.4 spec defines a minimum maximum (i.e. the minimal value that implementations must provide) number of texels MAX_TEXTURE_BUFFER_SIZE. The potential number of texels being stored in your buffer object is computed as follows (as found in GL_ARB_texture_buffer_object):
floor(<buffer_size> / (<components> * sizeof(<base_type>))
The resulting value clamped to MAX_TEXTURE_BUFFER_SIZE is the number of addressable texels.
Example:
You have a buffer object storing 4MiB of data. What you want is a buffer texture for addressing RGBA texels, so you choose an internal format RGBA8. The addressable number of texels is then
floor(4MiB / (4 * sizeof(UNSIGNED_BYTE)) == 1024^2 texels == 2^20 texels
If your implementation supports this number, you can address the full range of values in your buffer object. The above isn't too impressive and can simply be achieved with any other texture on current implementations. However, the machine on which I'm writing this answer supports 2^28 == 268435456 texels.
With OpenGL 4.4 (and 4.3 and possibly with earlier 4.x versions), the MAX_TEXTURE_SIZE is 2 ^ 16 texels per 1D-texture, so a buffer texture can still be 4 times as large. On my local machine I can allocate a 2GiB buffer texture (even larger actually), but only a 1GiB 1D-texture when using RGBAF32 texels.
A use-case for buffer textures is random (and atomic, if desired) read-/write-access (the latter via image load/store) to a large data store inside a shader. Yes, you can do random read-access on arrays of uniforms inside one or multiple blocks but it get's very tedious if you have to process a lot of data and have to work with multiple blocks and even then, looking at the maximum combined size of all uniform components (where a single float component has a size of 4 bytes) in all uniform blocks for a single stage,
MAX_(stage)_UNIFORM_BLOCKS *
MAX_UNIFORM_BLOCK_SIZE +
MAX_(stage)_UNIFORM_COMPONENTS * 4
isn't really a lot of space to work with in a shader stage (depending on how large your implementation allows the above number to be).
An important difference between textures and buffer textures is that the data store, as a regular buffer object, can be used in operations where a texture simply does not work. The extension mentions:
The use of a buffer object to provide storage allows the texture data to
be specified in a number of different ways: via buffer object loads
(BufferData), direct CPU writes (MapBuffer), framebuffer readbacks
(EXT_pixel_buffer_object extension). A buffer object can also be loaded
by transform feedback (NV_transform_feedback extension), which captures
selected transformed attributes of vertices processed by the GL. Several
of these mechanisms do not require an extra data copy, which would be
required when using conventional TexImage-like entry points.
An implication of using buffer textures is that look-ups inside a shader can only be done via texelFetch. Buffer textures also aren't mip-mapped and, as you already mentioned, during fetches there is no filtering.
Addendum:
Since OpenGL 4.3, we have what is called a
Shader Storage Buffer. These too provide random (atomic) read-/write-access to a large data store but don't need to be accessed with texelFetch() or image load/store functions as is the case for buffer textures. Using buffer textures also implies having to deal with gvec4 return values, both with texelFetch() and imageLoad() / imageStore(). This becomes very tedious as soon as you want to work with structures (or arrays thereof) and you don't want to think of some stupid packing scheme using multiple instances of vec4 or using multiple buffer textures to achieve something similar. With a buffer accessed as shader storage, you can simple index into the data store and pull one or more instances of some struct {} directly from the buffer.
Also, since they are very similar to uniform blocks, using them should be fairly straight forward - if you know how to use uniform buffers, you don't have a long way to go learn how to use shader storage buffers.
It's also absolutely worth browsing the Issues section of the corresponding ARB extension.
Performance Implications
Daniel Rakos did some performance analysis years ago, both as a comparison of uniform buffers and buffer textures, and also on a little more general note based on information from AMD's OpenCL programming guide. There is now a very recent version, specifically targeting OpenCL optimization an AMD platforms.
There are many factors influencing performance:
access patterns and resulting caching behavior
cache line sizes and memory layou
what kind of memory is accessed (registers, local, global, L1/L2 etc.) and its respective memory bandwidth
how well memory fetching latency is hidden by doing something else in the meantime
what kind of hardware you're on, i.e. a dedicated graphics card with dedicated memory or some unified memory architecture
etc., etc.
As always when worrying about performance: implement something that works and see if that solutions is fast enough for your needs. Otherwise, implement two or more approaches to solving the problem, profile them and compare.
Also, vendor specific guides can offer a great deal of insight. The above mentioned OpenCL user and optimization guides provide a high-level architectural perspective and specific hints on how to optimize your CL kernels - stuff that's also relevant when developing shaders.
A one use case I have found was to store per primitive attributes (accessed in the fragment shader with help of gl_PrimitiveID) while still maintaining unique vertices in the indexed mesh.