I wanted to render multiple video streams using OpenGL. Currently I am performing using glTexImage2D provided by JOGL and rendering on Swing window.
For updating texture content for each video frame I am calling glTexImage2D. I want to know is there any faster method to update texture without calling glTexImage2D for each frame.
You will always be using glTexImage2D, but with the difference that data comes from a buffer object (what is this?) rather than from a pointer.
What's slow in updating a texture is not updating the texture, but synchronizing (blocking) with the current draw operation, and the PCIe transfer. When you call glTexImage, OpenGL must wait until it is done drawing the last frame during which it is still reading from the texture. During that time, your application is blocked and does nothing (this is necessary because otherwise you could modify or free the memory pointed to before OpenGL can copy it!). Then it must copy the data and transfer it to the graphics card, and only then your application continues to run.
While one can't make that process much faster, one can make it run asynchronously, so this latency pretty much disappears.
The easiest way of doing this is to for video frames is to create a buffer name, bind it, and reserve-initialize it once.
Then, on each subsequent frame, discard-initialize it by calling glBufferData with a null data pointer, and fill it either with a non-reserving call, or by mapping the buffer's complete range.
The reason why you want to do this strange dance instead of simply overwriting the buffer is that this will not block. OpenGL will synchronize access to buffer objects so you do not overwrite data while it is still reading from it. glBufferData with a null data pointer is a way of telling OpenGL that you don't really care about the buffer and that you don't necessary want the same buffer. So it will just allocate another one and give you that one, keep reading from the old one, and secretly swap them when it's done.
Since the word "synchronization" was used already, I shall explain my choice of glMapBufferRange in the link above, when in fact you want to map the whole buffer, not some range. Why would one want that?
Even if OpenGL can mostly avoid synchronizing when using the discard technique above, it may still have to, sometimes.
Also, it still has to run some kind of memory allocation algorithm to manage the buffers, which takes driver time. glMapBufferRange lets you specify additional flags, in particular (in later OpenGL versions) a flag that says "don't synchronize". This allows for a more complicated but still faster approach in which you create a single buffer twice the size you need once, and then keep mapping/writing either the lower or upper half, telling OpenGL not to synchronize at all. It is then your responsibility to know when it's safe (presumably by using a fence object), but you avoid all overhead as much as possible.
You can't update the texture without updating the texture.
Also I don't think that one call to glTexImage can be a real performance problem. If you are so oh concerned about it though, create two textures and map one of them for writing when using the other for drawing, then swap (just like double-buffering works).
If you could move processing to GPU you wouldn't have to call the function at all, which is about 100% speedup.
Related
Recently I looked into improving texture submissions for streaming and whatnot and despite my long searches I have not found any material presenting or even mentioning any way of using PBOs with DSA only functions.
Am I not looking in the right places or is there really no way as of yet?
All of the pixel transfer functions can take either a buffer object+offset or a client CPU pointer (unlike VAO functions, for example, which can only work with buffers now). As such, allowing you to pass a buffer object+offset directly would require having a separate entrypoint for each of the two ways of doing pixel transfer. So they would need glNamedReadPixelsToBuffer and glNamedReadPixelsToClient.
So instead of further proliferating the number of functions (and instead of forbidding using client memory), they make the buffer part work the way it always did: through a binding point. So yes, you're still going to have to bind that buffer to the PACK/UNPACK binding.
Since pixel transfers are not exactly a common operation (relative to the number of other kinds of state changing and rendering commands), and since these particular binds are not directly tied to the GPU, it shouldn't affect your code that much. Plus, there's already a lot of context state tied to pixel transfer operations; what does one more matter?
I whipped up a simple C program (on github) that uses OpenGL to draw a bunch of triangles from a buffer that was allocated with glBufferStorage like so:
glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
GLbitfield bufferStorageFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
glBufferStorage(GL_ARRAY_BUFFER, vboSize, 0, bufferStorageFlags);
vert *triData = glMapBufferRange(GL_ARRAY_BUFFER, 0, vboSize, bufferStorageFlags);
I am aware that synchronization is my responsibility when using glBufferStorage with MAP_PERSISTENT_BIT, but I'm not sure exactly what I need to protect against.
The only time I touch triData is before calling glDrawArrays on it, and after calling SDL_GL_SwapWindow, so I know drawing of the last frame is done, and I haven't called for drawing of this frame to begin yet.
This appears to work perfectly, even with vsync disabled.
The wiki says:
Swapping the back and front buffers on the Default Framebuffer may
cause some form of synchronization ... if there are still commands
affecting the default framebuffer that have not yet completed.
Swapping buffers only technically needs to sync to the last command
that affects the default framebuffer, but it may perform a full
glFinish​.
But every article I've read on the subject makes extensive use of GLsync
pointers, though maybe they were just assuming I might want to use the buffer in more complex ways?
For now, am I right to believe SDL_GL_SwapWindow is providing sufficient synchronization?
The previous answers are correct in saying that you do need synchronization even after you use a swap. But I wanted to make even clearer that this is more than just a theoretical concern.
Swap operations are typically not synchronous. It's very common to let the rendering get 1-2 frames ahead of the display. This is done to reduce "bubbles" where the GPU temporarily goes into an idle state. If your swap call were synchronous, the GPU would unavoidably be idle at the time it returns, since all previously submitted work would have completed. Even if you immediately started rendering again, it would take a little time for that work to actually reach the GPU for execution. So you have times where the GPU does nothing, which hurts performance at least as long as your rendering is entirely GPU limited.
Now, you obviously don't want the rendering to get too far ahead of the display. Undesired side effects of that would be increased latency in responding to user input (which is a big deal for games), and excessive memory usage for queued up rendering commands. Therefore, there needs to be throttling before this happens. This throttling is often applied as part of swap operations, but it can potentially happen almost anywere.
So if you measure the wall clock time taken for a swap call to return, it's fairly common for it to be long enough to suggest that it's blocking. But this does not suggest that the call itself is synchronous. It may just be blocking until a previous frame completes, to prevent the rendering to get too far ahead of the display.
Here's my favorite advice about any multithreaded/asynchronous code:
If multithreaded code isn't immediately, obviously, provably correct then it is almost certainly wrong.
You cannot prove that OpenGL will not read from a value you are writing to. Therefore, it is wrong, even if no problems are apparent.
Yes, you need to do explicit synchronization. Even though you coherently mapped the buffer, you still cannot change the values in it while OpenGL might be reading from them. You must wait until after the last call that reads from that data before writing to it again. And the only ways that OpenGL has to wait for it to get finished is either glFinish or glClientWaitSync.
I am aware that synchronization is my responsibility when using glBufferStorage,
No, not necessarily. A buffer created with glBufferStorage is no different than a buffer created with glBuffer, except for the fact that you can't re-specify it.
You only need to do manual synchronization when mapping with the MAP_PERSISTENT_BIT (which was included in the same extension that glBufferStorage was, ARB_buffer_storage).
I want a really fast way to capture the content of the openGL framebuffer for my application. Generally, glReadPixels() is used for reading the content of framebuffer into a buffer. But this is slow.
I was trying to parallelise the procees of reading the framebuffer content by creating 4 threads to read framebuffer from 4 different regions using glReadPixels(). But the application is exiting due to segmentation fault. If I remove the glReadPixels() call from threads then application is running properly.
Threads do not work, abstain from that approach.
Creating several threads fails, as you have noticed, because only one thread has a current OpenGL context. In principle, you could make the context current in each worker thread before calling glReadPixels, but this will require extra synchronization from your side (otherwise, a thread could be preempted in between making the context current and reading back!), and (wgl|glx)MakeCurrent is a terribly slow function that will seriously stall OpenGL. In the end, you'll be doing more work to get something much slower.
There is no way to make glReadPixels any faster1, but you can decouple the time it takes (i.e. the readback runs asynchronously), so it does not block your application and effectively appears to run "faster".
You want to use a Pixel buffer object for that. Be sure to get the buffer flags correct.
Note that mapping the buffer to access its contents will still block if the complete contents hasn't finished transferring, so it will still not be any faster. To account for that, you either have to read the previous frame, or use a fence object which you can query to be sure that it's done.
Or, simpler but less reliable, you can insert "some other work" in between glReadPixels and accessing the data. This will not guarantee that the transfer has finished by the time you access the data, so it may still block. However, it may just work, and it will likely block for a shorter time (thus run "faster").
1 There are plenty of ways of making it slower, e.g. if you ask OpenGL to do some weird conversions or if you use wrong buffer flags. However, generally, there's no way to make it faster since its speed depends on all previous draw commands having finished before the transfer can even start, and the data being transferred over the PCIe bus (which has a fixed time overhead plus a finite bandwidth).
The only viable way of making readbacks "faster" is hiding this latency. It's of course still not faster, but you don't get to feel it.
The standards allude to rendering starting upon my first gl command and continuing in parallel to further commands. Certain functions, like glBufferSubData indicate loading can happen during rendering so long as the object is not currently in use. This introduces a logical concept of a "frame", though never explicitly mentioned in the standard.
So my question is what defines this logical frame? That is, which calls demarcate the game, such that I can start making gl calls again without interefering with the previous frame?
For example, using EGL you eventually call eglSwapBuffers (most implementations have some kind of swap command). Logically this is the boundary between one frame and the next. However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns. Yet the documentation implies you can start issuing new commands prior to its return in another thread (provided you don't touch any in-use buffers).
How can I start issuing commands to the next buffer even while the swap command is still blocking on the previous buffer? I would like to start streaming data for the next frame while the GPU is working on the old frame (in particular, I will have two vertex buffers which would be swapped each frame specifically for this purpose, and alluded to in the OpenGL documentation).
OpenGL has no concept of "frame", logical or otherwise.
OpenGL is really very simple: every command executes as if all prior commands had completed before hand.
Note the key phrase "as if". Let's say you render from a buffer object, then modify its data immediately afterwards. Like this:
glBindVertexArray(someVaoThatUsesBufferX);
glDrawArrays(...);
glBindBuffer(GL_ARRAY_BUFFER, BufferX);
glBufferSubData(GL_ARRAY_BUFFER, ...);
This is 100% legal in OpenGL. There are no caveats, questions, concerns, etc about exactly how this will function. That glBufferSubData call will execute as though the glDrawArrays command has finished.
The only thing you have to consider is the one thing the specification does not specify: performance.
An implementation is well within its rights to detect that you're modifying a buffer that may be in use, and therefore stall the CPU in glBufferSubData until the rendering from that buffer is complete. The OpenGL implementation is required to do either this or something else that prevents the actual source buffer from being modified while it is in use.
So OpenGL implementations execute commands asynchronously where possible, according to the specification. As long as the outside world cannot tell that glDrawArrays didn't finish drawing anything yet, the implementation can do whatever it wants. If you issue a glReadPixels right after the drawing command, the pipeline would have to stall. You can do it, but there is no guarantee of performance.
This is why OpenGL is defined as a closed box the way it is. This gives implementations lots of freedom to be asynchronous wherever possible. Every access of OpenGL data requires an OpenGL function call, which allows the implementation to check to see if that data is actually available yet. If not, it stalls.
Getting rid of stalls is one reason why buffer object invalidation is possible; it effectively tells OpenGL that you want to orphan the buffer's data storage. It's the reason why buffer objects can be used for pixel transfers; it allows the transfer to happen asynchronously. It's the reason why fence sync objects exist, so that you can tell whether a resource is still in use (perhaps for GL_UNSYNCHRONIZED_BIT buffer mapping). And so forth.
However, this calls blocks to support v-sync, meaning you can't issue new commands until it returns.
Says who? The buffer swapping command may stall. It may not. It's implementation-defined, and it can be changed with certain commands. The documentation for eglSwapBuffers only says that it performs a flush, which could stall the CPU but does not have to.
So from looking around at examples and tutorials, it seems the most common method of placing buffers in the pipeline is, every model object gets it's own vertex buffer, and then after the buffers are filled, they lock, set the buffers, unlock, set shaders, draw, and rinse/repeat for every models individual buffer. seems to me that all that locking and unlocking would slow things down a bit.
So i'm wondering if maybe the model objects could instead aggregate all their vertices into 1 big array, all the indices in a big array, create 1 large buffer, lock once, set buffers once, unlock, and then switch out shaders and draw as many polygons as required with those shaders, and just work your way along the buffer drawing and switching shaders like before, instead of having to lock and drop more vertices in the pipeline every time before you draw.
Would this be any more efficient, or do you think the overhead from all the bookkeeping involved,(for example, from index a to index b, use this shader), would just make this more work than it's worth?
also, if i have missed a concept of d3d here, please inform me. ( i'm new)
EDIT
due to massive misunderstanding, anywhere where i refered to locking and unlocking, was actually supposed to just be calling IASetVertexBuffer/IASetIndexBuffer. The "revised" question is more or less:
Does stuffing the vertices for all the models in the scene into one single buffer, and simply calling IASetVertexBuffer once improve perforamance at all?
So from looking around at examples and tutorials
Stop. Most "examples and tutorials" for anything are not intended to show best performance practices. Unless they are specifically about best performance practices. They're trying to show in the clearest and cleanest way how to perform task X. Optimization is an entirely other issue. Optimized code is a lit less clear and clean than unoptimal code; thus, many optimizations would get in the way of the tutorial's stated purpose.
So never assume that just because a tutorial does it some way, that's the fastest way to do something. It is simply one way to do it.
then after the buffers are filled, they lock, set the buffers, unlock, set shaders, draw, and rinse/repeat for every models individual buffer.
Locking and unlocking is for modifying the buffer. If you're not modifying it... why are you locking it? And if you are modifying it, then you're doing some form of buffer streaming, which requires special handling in order to make it efficient.
If you're doing streaming, then that's a different question you should ask (ie: how to do high-performance vertex streaming).
That isn't to say that putting the data for multiple objects in one buffer isn't a good idea. But if it is, the reason for it has less to do with locking and unlocking and more to do with the possibility of drawing multiple objects with a single draw call.
In general the fewer locks the better, every lock has to be an in-sync transfer between system memory and graphics card memory that stalls your GPU. The more you can batch these transfers together the better.
An even better improvement however is to leave buffers that don't change alone. You won't always need to reload bench #1221 every. single. frame. It never changes (*). So load your static art at the beginning and just draw it as needed. And before you think of culling half the bench away in preprocessing, think twice about the cost of locking a buffer just to get rid of a few vertices when your GPU already knows how to do basic culling at lightning speeds.
(*) assuming it doesn't change of course :)