Allocating PBOs on separate thread? - c++

I've got a multithreaded OpenGL application using PBOs for data transfers between cpu and gpu.
I have pooled the allocation of PBOs, however, when the pools are empty my non-opengl threads have to block for a while until the OpenGL thread reaches a point where it can allocate the buffers (i.e. finished rendering the current frame). This waiting is causing some lag spikes in certain situations which I'd like to avoid.
Is it possible to allocate PBO's on another thread which are then used by the "main" OpenGL thread?

Yes, you can create objects on one thread which can be used on another. You'll need a new GL context to do this.
That being said, there are two concerns.
First, it depends on what you mean by "allocation of PBOs". You should never be allocating buffer objects in the middle of a frame. You should allocate all the buffers you need up front. When the time comes to use them, then you can simply use what you have.
By "allocate", I mean call glBufferData on a previously allocated buffer using a different size or driver hint than was used before. Or by using glGenBuffers and glDeleteBuffers in any way. Neither of these should happen within a frame.
Second, invalidating the buffer should never cause "lag spikes". By "invalidate", I mean reallocating the buffer with glBufferData using the same size and usage hint, or using glMapBufferRange with the GL_INVALIDATE_BUFFER bit. You should look at this page on how to stream buffer object data for details. If you're getting issues, then you're probably on NVIDIA hardware and using the wrong buffer object hint (ie: use STREAM).

Related

GPU Memory Management in OpenGL and DirectX 12

I am currently improving my knowledge in OpenGL and DirectX 12 in order to create graphics applications with both APIs. I studied several tutorials but I still do not completely understand, how the memory is managed on the GPU side.
In OpenGL (my application runs an OpenGL 3.3 context), the frame buffers are created implicitly so I assume, that they are also freed implicitly by the API. In my example program, I created vertex and index buffers using glGenBuffers and uploaded them to the GPU using glBufferData. In case I want to update my vertex buffer every frame, I could simply do this using glBufferSubData. Let's assume instead, that I want to re-upload my vertex buffer every frame using glBufferData. According to the OpenGL documentation, this function creates and initializes the buffer's data store on the GPU. So I assume, that the GPU memory, mapped to this VBO is reused after another call to glBufferData in the next frame.
In DirectX 12, the frame buffers must be created by the graphics programmer. Those are managed and reused by the swap chain during the life time of the program. In my DirectX 12 test program, I also create vertex and index buffers using upload heaps and the ID3D12Device::CreateCommittedResource function. I also do this every frame for testing purposes. The buffers are stored in Microsoft::WRL::ComPtr<ID3D12Resource> variables. At the end of the render method, the use count of those buffer pointers should hit 0, which will free the memory behind on the CPU side. Nevertheless, I do not understand, what happens to the data and the underlying heap on the GPU side. Are they released, whenever the buffer pointer's use count hits 0, do they need to be freed manually, are they discarded by the GPU when reaching the fence or none of them.
I would really appreciate it, if you could provide some clarifications on this topic and my assumptions.
Can you also please provide an explanation, if and how GPU data needs to be freed by the graphics programmer.
Best regards.
For DirectX 12, it uses the same lifetime model as previous versions of Direct3D: The object is kept alive until the reference count hits 0. It's then eligible for destruction. The exact time of cleanup is up to the driver/runtime as it typically does 'delayed destruction' (it actually has both an 'internal' and 'external' reference count, and both have to be 0 before it's really eligible for destruction).
See Microsoft Docs.
You should watch this YouTube video if you want an overview of the underlying details.

Does swap buffer with vsynch guarantee synchronization?

I was wondering if I could assume that all buffer related GPU operations such as:
glDrawElements
glBufferData
glSubBufferData
glUnmapBuffer
are guaranteed to be completed after swap buffer is performed (i.e. frame is finished) assuming vsync is on.
I'm confused as I've come across implementations of vertex streaming techniques such as rond robin vbo which imply that a vbo could still be in use during the next frame.
What I basically want to do is stream vertices through glMapBufferRange with GL_UNSYNCHRONIZED_BIT, managing the correct ranges myself so that writes and reads never overlap. This would work very well if I could just assume synchronization and reset the stream range index at the end of the frame.
In other words, does swap buffer with vsynch guarantee synchronization?
glDrawElements glBufferData glSubBufferData glUnmapBuffer are guaranteed to be completed after swap buffer is performed (i.e. frame is finished) assuming vsync is on.
No; that would be terrible for performance. That would basically impose a full GPU/CPU synchronization simply because someone wants to display a new image. Even though both the production of that image and its display are GPU processes (or at least, not necessarily synchronous with your CPU thread/process).
The point of vsync is to ensure that the new image doesn't get swapped in until the vertical synchronization period, to avoid visual tearing of the image, where half of the display comes from the old and half from the new. This is not about ensuring that anything has actually completed on the GPU relative to CPU execution.
If you are streaming data into buffer objects via persistent mapping (which should be preferred over older "unsychronized" shenanigans), then you need to perform the synchronization yourself. Set a fence sync object after you have issued the rendering commands that will use data from the buffer region you wrote to. Then when it comes time to try to write to that buffer region again, check the fence sync and wait until its available. This also gives you the freedom to expand the number of such buffer regions you have if rendering is consistently delayed.

When is data sent to the GPU with openGL

I've been looking into writing applications using OpenGL to render data on-screen, and there is one thing that constantly comes up -- it is slow to copy data into the GPU.
I am currently switching between reading the OpenGL SuperBible 7th Edition and reading various tutorials online, and I have not come across when data is actually sent to the GPU, I only have guesses.
Is space allocated in the GPU's ram when I make calls to glBufferStorage/glCreateVertexArrays? Or is this space allocated in my application's memory and then copied over at a later time?
Is the pointer returned from glMapBuffer* a pointer to GPU memory, or is it a pointer to space allocated in my applications memory that is then copied over at a later time?
Assuming that the data is stored in my applications memory and copied over to the GPU, when is the data actually copied? When I make a call to glCrawArrays?
1: glCreateVertexArrays doesn't have anything to do with buffer objects or GPU memory (of that kind), so it's kinda irrelevant.
As for the rest, when OpenGL decides to allocate actual GPU memory is up to the OpenGL implementation. It can defer the actual allocation as long as it wants.
If you're asking about when your data is uploaded to OpenGL, OpenGL will always be finished with any pointer you pass it when that function call returns. So the implementation will either copy the data to the GPU-accessible memory within the call itself, or it will allocate some CPU memory and copy your data into that, scheduling the transfer to the actual GPU storage for later.
As a matter of practicality, you should assume that copying to the buffer doesn't happen immediately. This is because DMAs usually require certain memory alignment, and the pointer you pass may not have that alignment.
But usually, you shouldn't care. Let the implementation do its job.
2: Like the above, the implementation can do whatever it wants when you map memory. It might give you a genuine pointer to GPU-accessible memory. Or it might just allocate a block of CPU memory and DMA it up when you unmap the memory.
The only exception to this is persistent mapping. That feature requires that OpenGL give you an actual pointer to the actual GPU-accessible memory that the buffer resides in. This is because you never actually tell the implementation when you're finished writing to/reading from the memory.
This is also (part of) why OpenGL requires you to allocate buffer storage immutably to be able to use persistent mapping.
3: It is copied whenever the implementation feels that it needs to be.
OpenGL implementations are a black box. What they do is more-or-less up to them. The only requirement the specification makes is that their behavior act "as if" it were doing things the way the specification says. As such, the data can be copied whenever the implementation feels like copying it, so long as everything still works "as if" it had copied it immediately.
Making a draw call does not require that any buffer DMAs that this draw command relies on have completed at that time. It merely requires that those DMAs will happen before the GPU actually executes that drawing command. The implementation could do that by blocking in the glDraw* call until the DMAs have completed. But it can also use internal GPU synchronization mechanisms to tie the drawing command being issued to the completion of the DMA operation(s).
The only thing that will guarantee that the upload has actually completed is to call a function that will cause the GPU to access the buffer, then synchronizing the CPU with that command. Synchronizing after only the upload doesn't guarantee anything. The upload itself is not observable behavior, so synchronizing there may not have an effect.
Then again, it might. That's the point; you cannot know.

OpenGL multithreading/shared context and glGenBuffers

If I plan to use multithreading in OpenGL, should I have separate buffers (from glGenBuffers) for each context?
I do not know much about OpenGL multithreading yet (for now I work in "single" thread). I need to know if I can share buffers already pushed to Video Memory (with glBufferData/glBufferSubData), or I have to keep copy of buffer for another thread.
You do not want to use several contexts with several threads. You really don't.
While this sounds like a good idea, in practice muli-context-multi-thread is complicated, troublesome, and badly supported on the driver side, and it only marginally improves (possibly even reduces!) performance.
What you really want is to have only one thread talk to OpenGL (with one context, obviously), map a buffer, and pass the memory pointer to another thread, preferrably using 3 buffers (3 subbuffers of a 3x sized buffer) with immutable storage and persistent mapping, if this is available.
That, and doing indirect render calls, where a second thread feeds the buffers the indirect call reads from.
Further info on the persistent mapping topic: See in particular slides 22-25 of this GDC2014 presentation, which is basically a remake of Cass Everitt's 2013 SIGGRAPH talk.
See also Everitt's original talk: Beyond porting.
Vaos aren't shared so you'll need to generate a new vao for each object per context or else the behavior will become unpredictable and incorrect upon deletion / creation of a new one. This can be a major source of error. Vbos can be shared, so you just need one vbo per object.

Efficient way of updating texture in OpenGL

I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.
You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.
You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.