Is there a Vulkan equivalent to OpenGL's pixel buffer object? - opengl

I did some Googling and it does not appear that Vulkan has a Pixel Buffer Object. Is there something analogous to it in the Vulkan API?

OpenGL doesn't "have a Pixel Buffer Object" either. What OpenGL has is memory, aka: buffer objects. One of the uses of buffer objects is as the source/destination for pixel transfer operations; when used with buffer objects, they can execute asynchronously. While doing this is commonly called "pixel buffer objects", it's not a special object. It's just using OpenGL-allocated memory to perform an asynchronous copy of image data into/outof a buffer object.
OpenGL needs a special system for that because it is inherently a synchronous API. By contrast, almost nothing in Vulkan is synchronous. So Vulkan doesn't need a special system for doing it.
vkCmdCopyImageToBuffer is a Vulkan command, as identified because it starts with vkCmd. As such, it is not executed immediately; such commands are stored in Vulkan command buffers, to be executed by the GPU asynchronously.
Vulkan doesn't have a special system for doing asynchronous pixel copies because Vulkan operations are by default asynchronous. And unlike OpenGL, it will not try to hide this from you.

Related

GPU Memory Management in OpenGL and DirectX 12

I am currently improving my knowledge in OpenGL and DirectX 12 in order to create graphics applications with both APIs. I studied several tutorials but I still do not completely understand, how the memory is managed on the GPU side.
In OpenGL (my application runs an OpenGL 3.3 context), the frame buffers are created implicitly so I assume, that they are also freed implicitly by the API. In my example program, I created vertex and index buffers using glGenBuffers and uploaded them to the GPU using glBufferData. In case I want to update my vertex buffer every frame, I could simply do this using glBufferSubData. Let's assume instead, that I want to re-upload my vertex buffer every frame using glBufferData. According to the OpenGL documentation, this function creates and initializes the buffer's data store on the GPU. So I assume, that the GPU memory, mapped to this VBO is reused after another call to glBufferData in the next frame.
In DirectX 12, the frame buffers must be created by the graphics programmer. Those are managed and reused by the swap chain during the life time of the program. In my DirectX 12 test program, I also create vertex and index buffers using upload heaps and the ID3D12Device::CreateCommittedResource function. I also do this every frame for testing purposes. The buffers are stored in Microsoft::WRL::ComPtr<ID3D12Resource> variables. At the end of the render method, the use count of those buffer pointers should hit 0, which will free the memory behind on the CPU side. Nevertheless, I do not understand, what happens to the data and the underlying heap on the GPU side. Are they released, whenever the buffer pointer's use count hits 0, do they need to be freed manually, are they discarded by the GPU when reaching the fence or none of them.
I would really appreciate it, if you could provide some clarifications on this topic and my assumptions.
Can you also please provide an explanation, if and how GPU data needs to be freed by the graphics programmer.
Best regards.
For DirectX 12, it uses the same lifetime model as previous versions of Direct3D: The object is kept alive until the reference count hits 0. It's then eligible for destruction. The exact time of cleanup is up to the driver/runtime as it typically does 'delayed destruction' (it actually has both an 'internal' and 'external' reference count, and both have to be 0 before it's really eligible for destruction).
See Microsoft Docs.
You should watch this YouTube video if you want an overview of the underlying details.

Does swap buffer with vsynch guarantee synchronization?

I was wondering if I could assume that all buffer related GPU operations such as:
glDrawElements
glBufferData
glSubBufferData
glUnmapBuffer
are guaranteed to be completed after swap buffer is performed (i.e. frame is finished) assuming vsync is on.
I'm confused as I've come across implementations of vertex streaming techniques such as rond robin vbo which imply that a vbo could still be in use during the next frame.
What I basically want to do is stream vertices through glMapBufferRange with GL_UNSYNCHRONIZED_BIT, managing the correct ranges myself so that writes and reads never overlap. This would work very well if I could just assume synchronization and reset the stream range index at the end of the frame.
In other words, does swap buffer with vsynch guarantee synchronization?
glDrawElements glBufferData glSubBufferData glUnmapBuffer are guaranteed to be completed after swap buffer is performed (i.e. frame is finished) assuming vsync is on.
No; that would be terrible for performance. That would basically impose a full GPU/CPU synchronization simply because someone wants to display a new image. Even though both the production of that image and its display are GPU processes (or at least, not necessarily synchronous with your CPU thread/process).
The point of vsync is to ensure that the new image doesn't get swapped in until the vertical synchronization period, to avoid visual tearing of the image, where half of the display comes from the old and half from the new. This is not about ensuring that anything has actually completed on the GPU relative to CPU execution.
If you are streaming data into buffer objects via persistent mapping (which should be preferred over older "unsychronized" shenanigans), then you need to perform the synchronization yourself. Set a fence sync object after you have issued the rendering commands that will use data from the buffer region you wrote to. Then when it comes time to try to write to that buffer region again, check the fence sync and wait until its available. This also gives you the freedom to expand the number of such buffer regions you have if rendering is consistently delayed.

OpenGL: which OpenGL implementations are not pipelined?

In OpenGL wiki on Performance, it says:
"OpenGL implementations are almost always pipelined - that is to say,
things are not necessarily drawn when you tell OpenGL to draw them -
and the fact that an OpenGL call returned doesn't mean it finished
rendering."
Since it says "almost", that means there are some mplementations are not pipelined.
Here I find one:
OpenGL Pixel Buffer Object (PBO)
"Conventional glReadPixels() blocks the pipeline and waits until all
pixel data are transferred. Then, it returns control to the
application. On the contrary, glReadPixels() with PBO can schedule
asynchronous DMA transfer and returns immediately without stall.
Therefore, the application (CPU) can execute other process right away,
while transferring data with DMA by OpenGL (GPU)."
So this means conventional glReadPixels() (not with PBO) blocks the pipeline.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Then I am wondering:
which OpenGL implementations are not pipelined?
How about glDrawArrays?
The OpenGL specification itself does not specify the term "pipeline" but rather "command stream". The runtime behavior of command stream execution is deliberately left open, to give implementors maximal flexibility.
The important term is "OpenGL sychronization point": https://www.opengl.org/wiki/Synchronization
Here I find one: (Link to songho article)
Note that this is not an official OpenGL specification resource. The wording "blocks the OpenGL pipeline" is a bit unfortunate, because it gets the actual blocking and bottleneck turned "upside down". Essentially it means, that glReadPixels can only return once all the commands leading up to the image it will fetch have been executed.
So this means conventional glReadPixels() (not with PBO) blocks the pipeline. But actually in OpenGL reference of glReadPixels I cannot tell the fact.
Actually it's not the OpenGL pipeline that gets blocked, but the execution of the program on the CPU. It means, that the GPU sees no further commands coming from the CPU. So the pipeline doesn't get "blocked" but in fact drained. When a pipeline drains, or needs to be restarted one says the pipeline has been stalled (i.e. the flow in the pipeline came to a halt).
From the GPUs point of view everything happens with maximum throughput: Render the stuff until the point glReadPixels got called, do a DMA transfer, unfortunately no further commands are available after initiating the transfer.
How about glDrawArrays?
glDrawArrays returns immediately after the data has been queued and necessary been made.
Actually it means that this specific operation can't be pipelined because all data needs to be transfered before the function returns, it doesn't mean other things can't be.
Operations like that are said to stall the pipeline. One function that will always stall the pipeline is glFinish.
Usually when the function returns a value like getting the contents of a buffer, it will induce a stall.
Depending on the driver implementation creating programs and buffers and such can be done without stalling.
Then I am wondering: which OpenGL implementations are not pipelined?
I could imagine that a pure software implementation might not be pipelined. Not much reason to queue up work if you end up executing it on the same CPU. Unless you wanted to take advantage of multi-threading.
But it's probably safe to say that any OpenGL implementation that uses dedicated hardware (commonly called GPU) will be pipelined. This allows the CPU and GPU to work in parallel, which is critical to get good system performance. Also, submitting work to the GPU incurs a certain amount of overhead, so it's beneficial to queue up work, and then submit it in larger batches.
But actually in OpenGL reference of glReadPixels I cannot tell the fact.
True. The man pages don't directly specify which calls cause a synchronization. In general, anything that returns values/data produced by the GPU causes synchronization. Examples that come to mind:
glFinish(). Explicitly requires a full synchronization, which is actually its only purpose.
glReadPixels(), in the non PBO case. The GPU has to finish rendering before you can read back the result.
glGetQueryObjectiv(id, GL_QUERY_RESULT, ...). Blocks until the GPU reaches the point where the query was submitted.
glClientWaitSync(). Waits until the GPU reaches the point where the corresponding glFenceSync() was submitted.
Note that there can be different types of synchronizations that are not directly tied to specific OpenGL calls. For example, in the case where the whole workload is GPU limited, the CPU would queue up an infinite about of work unless there is some throttling. So the driver will block the CPU at more or less arbitrary points to let the GPU catch up to a certain point. This could happen at frame boundaries, but it does not have to be. Similar synchronization can be necessary if memory runs low, or if internal driver resources are exhausted.

Efficient way of updating texture in OpenGL

I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.
You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.
You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.

OpenGL when can I start issuing commands again

The standards allude to rendering starting upon my first gl command and continuing in parallel to further commands. Certain functions, like glBufferSubData indicate loading can happen during rendering so long as the object is not currently in use. This introduces a logical concept of a "frame", though never explicitly mentioned in the standard.
So my question is what defines this logical frame? That is, which calls demarcate the game, such that I can start making gl calls again without interefering with the previous frame?
For example, using EGL you eventually call eglSwapBuffers (most implementations have some kind of swap command). Logically this is the boundary between one frame and the next. However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns. Yet the documentation implies you can start issuing new commands prior to its return in another thread (provided you don't touch any in-use buffers).
How can I start issuing commands to the next buffer even while the swap command is still blocking on the previous buffer? I would like to start streaming data for the next frame while the GPU is working on the old frame (in particular, I will have two vertex buffers which would be swapped each frame specifically for this purpose, and alluded to in the OpenGL documentation).
OpenGL has no concept of "frame", logical or otherwise.
OpenGL is really very simple: every command executes as if all prior commands had completed before hand.
Note the key phrase "as if". Let's say you render from a buffer object, then modify its data immediately afterwards. Like this:
glBindVertexArray(someVaoThatUsesBufferX);
glDrawArrays(...);
glBindBuffer(GL_ARRAY_BUFFER, BufferX);
glBufferSubData(GL_ARRAY_BUFFER, ...);
This is 100% legal in OpenGL. There are no caveats, questions, concerns, etc about exactly how this will function. That glBufferSubData call will execute as though the glDrawArrays command has finished.
The only thing you have to consider is the one thing the specification does not specify: performance.
An implementation is well within its rights to detect that you're modifying a buffer that may be in use, and therefore stall the CPU in glBufferSubData until the rendering from that buffer is complete. The OpenGL implementation is required to do either this or something else that prevents the actual source buffer from being modified while it is in use.
So OpenGL implementations execute commands asynchronously where possible, according to the specification. As long as the outside world cannot tell that glDrawArrays didn't finish drawing anything yet, the implementation can do whatever it wants. If you issue a glReadPixels right after the drawing command, the pipeline would have to stall. You can do it, but there is no guarantee of performance.
This is why OpenGL is defined as a closed box the way it is. This gives implementations lots of freedom to be asynchronous wherever possible. Every access of OpenGL data requires an OpenGL function call, which allows the implementation to check to see if that data is actually available yet. If not, it stalls.
Getting rid of stalls is one reason why buffer object invalidation is possible; it effectively tells OpenGL that you want to orphan the buffer's data storage. It's the reason why buffer objects can be used for pixel transfers; it allows the transfer to happen asynchronously. It's the reason why fence sync objects exist, so that you can tell whether a resource is still in use (perhaps for GL_UNSYNCHRONIZED_BIT buffer mapping). And so forth.
However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns.
Says who? The buffer swapping command may stall. It may not. It's implementation-defined, and it can be changed with certain commands. The documentation for eglSwapBuffers only says that it performs a flush, which could stall the CPU but does not have to.