When does glBufferSubData return? [duplicate] - opengl

This question already has an answer here:
When does OpenGL get finished with pointers in functions?
(1 answer)
Closed 5 years ago.
I want to transfer contents of a very large memory chunk to a sufficiently large GPU buffer and then immediately alter the contents of memory on CPU. Something like this in pseudo-code:
glBindBuffer(/*very_large_buffer*/);
glBufferSubData(/*very_large_memory_chunk*/);
memset(/*zeros*/, /*very_large_memory_chunk*/);
In this code, what does glBufferSubData actually do? Does it transfer very_large_memory_chunk somewhere before return or just schedules the transfer operation for possibly later execution? So if I start altering the CPU buffer immediately, is it possible that partially altered memory will be transfered, yielding garbage in GPU's very_large_buffer?
Note that I'm not asking about rendering calls. I know that if the buffer is used for rendering, transfer operations will wait until rendering is complete and vice versa. I want to know if OpenGL behaves the alike way in CPU-to-GPU transfer operations.

OpenGL doesn't define how glBufferSubData has to be implemented: It can either copy the data immediately to GPU memory or it may defer the copy operation to a later point.
What OpenGL guarantees (OpenGL 4.5 Specification, Section 5.3) is that one can assume a call to glBufferSubData to be completed when the method returns. This means that every implementation that defers the CPU->GPU copy operation has to make sure that the CPU memory is copied before returning.
In conclusion: You can change the content of the pointer immediately after the glBufferSubData returns without modifying/destroying the buffers content.

Related

When is data sent to the GPU with openGL

I've been looking into writing applications using OpenGL to render data on-screen, and there is one thing that constantly comes up -- it is slow to copy data into the GPU.
I am currently switching between reading the OpenGL SuperBible 7th Edition and reading various tutorials online, and I have not come across when data is actually sent to the GPU, I only have guesses.
Is space allocated in the GPU's ram when I make calls to glBufferStorage/glCreateVertexArrays? Or is this space allocated in my application's memory and then copied over at a later time?
Is the pointer returned from glMapBuffer* a pointer to GPU memory, or is it a pointer to space allocated in my applications memory that is then copied over at a later time?
Assuming that the data is stored in my applications memory and copied over to the GPU, when is the data actually copied? When I make a call to glCrawArrays?
1: glCreateVertexArrays doesn't have anything to do with buffer objects or GPU memory (of that kind), so it's kinda irrelevant.
As for the rest, when OpenGL decides to allocate actual GPU memory is up to the OpenGL implementation. It can defer the actual allocation as long as it wants.
If you're asking about when your data is uploaded to OpenGL, OpenGL will always be finished with any pointer you pass it when that function call returns. So the implementation will either copy the data to the GPU-accessible memory within the call itself, or it will allocate some CPU memory and copy your data into that, scheduling the transfer to the actual GPU storage for later.
As a matter of practicality, you should assume that copying to the buffer doesn't happen immediately. This is because DMAs usually require certain memory alignment, and the pointer you pass may not have that alignment.
But usually, you shouldn't care. Let the implementation do its job.
2: Like the above, the implementation can do whatever it wants when you map memory. It might give you a genuine pointer to GPU-accessible memory. Or it might just allocate a block of CPU memory and DMA it up when you unmap the memory.
The only exception to this is persistent mapping. That feature requires that OpenGL give you an actual pointer to the actual GPU-accessible memory that the buffer resides in. This is because you never actually tell the implementation when you're finished writing to/reading from the memory.
This is also (part of) why OpenGL requires you to allocate buffer storage immutably to be able to use persistent mapping.
3: It is copied whenever the implementation feels that it needs to be.
OpenGL implementations are a black box. What they do is more-or-less up to them. The only requirement the specification makes is that their behavior act "as if" it were doing things the way the specification says. As such, the data can be copied whenever the implementation feels like copying it, so long as everything still works "as if" it had copied it immediately.
Making a draw call does not require that any buffer DMAs that this draw command relies on have completed at that time. It merely requires that those DMAs will happen before the GPU actually executes that drawing command. The implementation could do that by blocking in the glDraw* call until the DMAs have completed. But it can also use internal GPU synchronization mechanisms to tie the drawing command being issued to the completion of the DMA operation(s).
The only thing that will guarantee that the upload has actually completed is to call a function that will cause the GPU to access the buffer, then synchronizing the CPU with that command. Synchronizing after only the upload doesn't guarantee anything. The upload itself is not observable behavior, so synchronizing there may not have an effect.
Then again, it might. That's the point; you cannot know.

Efficient way of updating texture in OpenGL

I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.
You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.
You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.

OpenGL when can I start issuing commands again

The standards allude to rendering starting upon my first gl command and continuing in parallel to further commands. Certain functions, like glBufferSubData indicate loading can happen during rendering so long as the object is not currently in use. This introduces a logical concept of a "frame", though never explicitly mentioned in the standard.
So my question is what defines this logical frame? That is, which calls demarcate the game, such that I can start making gl calls again without interefering with the previous frame?
For example, using EGL you eventually call eglSwapBuffers (most implementations have some kind of swap command). Logically this is the boundary between one frame and the next. However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns. Yet the documentation implies you can start issuing new commands prior to its return in another thread (provided you don't touch any in-use buffers).
How can I start issuing commands to the next buffer even while the swap command is still blocking on the previous buffer? I would like to start streaming data for the next frame while the GPU is working on the old frame (in particular, I will have two vertex buffers which would be swapped each frame specifically for this purpose, and alluded to in the OpenGL documentation).
OpenGL has no concept of "frame", logical or otherwise.
OpenGL is really very simple: every command executes as if all prior commands had completed before hand.
Note the key phrase "as if". Let's say you render from a buffer object, then modify its data immediately afterwards. Like this:
glBindVertexArray(someVaoThatUsesBufferX);
glDrawArrays(...);
glBindBuffer(GL_ARRAY_BUFFER, BufferX);
glBufferSubData(GL_ARRAY_BUFFER, ...);
This is 100% legal in OpenGL. There are no caveats, questions, concerns, etc about exactly how this will function. That glBufferSubData call will execute as though the glDrawArrays command has finished.
The only thing you have to consider is the one thing the specification does not specify: performance.
An implementation is well within its rights to detect that you're modifying a buffer that may be in use, and therefore stall the CPU in glBufferSubData until the rendering from that buffer is complete. The OpenGL implementation is required to do either this or something else that prevents the actual source buffer from being modified while it is in use.
So OpenGL implementations execute commands asynchronously where possible, according to the specification. As long as the outside world cannot tell that glDrawArrays didn't finish drawing anything yet, the implementation can do whatever it wants. If you issue a glReadPixels right after the drawing command, the pipeline would have to stall. You can do it, but there is no guarantee of performance.
This is why OpenGL is defined as a closed box the way it is. This gives implementations lots of freedom to be asynchronous wherever possible. Every access of OpenGL data requires an OpenGL function call, which allows the implementation to check to see if that data is actually available yet. If not, it stalls.
Getting rid of stalls is one reason why buffer object invalidation is possible; it effectively tells OpenGL that you want to orphan the buffer's data storage. It's the reason why buffer objects can be used for pixel transfers; it allows the transfer to happen asynchronously. It's the reason why fence sync objects exist, so that you can tell whether a resource is still in use (perhaps for GL_UNSYNCHRONIZED_BIT buffer mapping). And so forth.
However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns.
Says who? The buffer swapping command may stall. It may not. It's implementation-defined, and it can be changed with certain commands. The documentation for eglSwapBuffers only says that it performs a flush, which could stall the CPU but does not have to.

Allocating PBOs on separate thread?

I've got a multithreaded OpenGL application using PBOs for data transfers between cpu and gpu.
I have pooled the allocation of PBOs, however, when the pools are empty my non-opengl threads have to block for a while until the OpenGL thread reaches a point where it can allocate the buffers (i.e. finished rendering the current frame). This waiting is causing some lag spikes in certain situations which I'd like to avoid.
Is it possible to allocate PBO's on another thread which are then used by the "main" OpenGL thread?
Yes, you can create objects on one thread which can be used on another. You'll need a new GL context to do this.
That being said, there are two concerns.
First, it depends on what you mean by "allocation of PBOs". You should never be allocating buffer objects in the middle of a frame. You should allocate all the buffers you need up front. When the time comes to use them, then you can simply use what you have.
By "allocate", I mean call glBufferData on a previously allocated buffer using a different size or driver hint than was used before. Or by using glGenBuffers and glDeleteBuffers in any way. Neither of these should happen within a frame.
Second, invalidating the buffer should never cause "lag spikes". By "invalidate", I mean reallocating the buffer with glBufferData using the same size and usage hint, or using glMapBufferRange with the GL_INVALIDATE_BUFFER bit. You should look at this page on how to stream buffer object data for details. If you're getting issues, then you're probably on NVIDIA hardware and using the wrong buffer object hint (ie: use STREAM).

Opengl Unsynchronized/Non-blocking Map

I just found the following OpenGL specification for ARB_map_buffer_range.
I'm wondering if it is possible to do non-blocking map calls using this extension?
Currently in my application im rendering to an FBO which I then map to a host PBO buffer.
glMapBuffer(target_, GL_READ_ONLY);
However, the problem with this is that it blocks the rendering thread while transferring the data.
I could reduce this issue by pipelining the rendering, but latency is a big issue in my application.
My question is whether i can use map_buffer_range with MAP_UNSYNCHRONIZED_BIT and wait for the map operation to finish on another thread, or defer the map operation on the same thread, while the rendering thread renders the next frame.
e.g.
thread 1:
map();
render_next_frame();
thread 2:
wait_for_map
or
thread 1:
map();
while(!is_map_ready())
do_some_rendering_for_next_frame();
What I'm unsure of is how I know when the map operation is ready, the specification only mentions "other synchronization techniques to ensure correct operation".
Any ideas?
If you map a buffer with GL_MAP_UNSYNCHRONIZED_BIT, the driver will not wait until OpenGL is done with that memory before mapping it for you. So you will get more or less immediate access to it.
The problem is that this does not mean that you can just read/write that memory willy-nilly. If OpenGL is reading from or writing to that buffer and you change it... welcome to undefined behavior. Which can include crashing.
Therefore, in order to actually use unsynchronized mapping, you must synchronize your behavior to OpenGL's access of that buffer. This will involve the use of ARB_sync objects (or NV_fence if you're only on NVIDIA and haven't updated your drivers recently).
That being said, if you're using a fence object to synchronize access to the buffer, then you really don't need GL_MAP_UNSYNCHRONIZED_BIT at all. Once you finish the fence, or detect that it has completed, you can map the buffer normally and it should complete immediately (unless some other operation is reading/writing too).
In general, unsynchronized access is best used for when you need fine-grained write access to the buffer. In this case, good use of sync objects will get you what you really need (the ability to tell when the map operation is finished).
Addendum: The above is now outdated (depending on your hardware). Thanks to OpenGL 4.4/ARB_buffer_storage, you can now not only map unsynchronized, you can keep a buffer mapped indefinitely. Yes, you can have a buffer mapped while it is in use.
This is done by creating immutable storage and providing that storage with (among other things) the GL_MAP_PERSISTENT_BIT. Then you glMapBufferRange, also providing the same bit.
Now technically, that changes pretty much nothing. You still need to synchronize your actions with OpenGL. If you write stuff to a region of the buffer, you'll need to either issue a barrier or flush that region of the buffer explicitly. And if you're reading, you still need to use a fence sync object to make sure that the data is actually there before reading it (and unless you use GL_MAP_COHERENT_BIT too, you'll need to issue a barrier before reading).
In general, it is not possible to do a "nonblocking map", but you can map without blocking.
The reason why there can be no "nonblocking map" is that the moment the function call returns, you could access the data, so the driver must make sure it is there, positively. If the data has not been transferred, what else can the driver do but block.
Threads don't make this any better, and possibly make it worse (adding synchronisation and context sharing issues). Threads cannot magically remove the need to transfer data.
And this leads to how to not block on mapping: Only map when you are sure that the transfer is finished. One safe way to do this is to map the buffer after flipping buffers or after glFinish or after waiting on a query/fence object. Using a fence is the preferrable way if you can't wait until buffers have been swapped. A fence won't stall the pipeline, but will tell you whether or not your transfer is done (glFinish may or may not, but will probably stall).
Reading after swapping buffers is also 100% safe, but may not be acceptable if you need the data within the same frame (works perfectly for screenshots or for calculating a histogram for tonemapping, though).
A less safe way is to insert "some other stuff" and hope that in the mean time the transfer has completed.
In respect of below comment:
This answer is not incorrect. It isn't possible to do any better than access data after it's available (this should be obvious). Which means that you must sync/block, one way or the other, there is no choice.
Although, from a very pedantic point of view, you can of course use GL_MAP_UNSYNCHRONIZED_BIT to get a non-blocking map operation, this is entirely irrelevant, as it does not work unless you explicitly reproduce the implicit sync as described above. A mapping that you can't safely access is good for nothing.
Mapping and accessing a buffer that OpenGL is transferring data to without synchronizing/blocking (implicitly or explicitly) means "undefined behavior", which is only a nicer wording for "probably garbage results, maybe crash".
If, on the other hand, you explicitly synchronize (say, with a fence as described above), then it's irrelevant whether or not you use the unsynchronized flag, since no more implicit sync needs to happen anyway.