Efficiently update Uniform Buffer Objects with instancing and culling - opengl

I've successfully updated my rendering engine to use uniform buffer objects and instancing.
The problem is that, since I do a first frustum culling pass every frame in order to know the objects I need to draw, I have to update the buffers every frame because the objects I draw could change every time, and this isn't the most efficient thing.
How could I make this more efficient?
The only thing I could think of is to not do the frustum culling so all the buffers remain static and I don't need to update them all the times, but not doing frustum culling I'd end up draw a lot of unnecessary objects.

Updating uniform buffers is fairly cheap, to be honest. You are quite limited in size and that prevents you from doing anything too crazy.
What you need to focus on to make this efficient is actually accommodating incomplete commands that are queued up. You are more likely to run into problems where the driver/GPU is forced to stop working on the next frame/command due to poor data write patterns than you are to run into data transfer rate limitations. The problem is always going to be avoiding situations where you might write to portions of data that are still in use by the GPU (it is often working on data 1-2 frames behind the CPU).
You have multiple options depending on your target version, and the OpenGL Wiki has a general overview of buffer streaming approaches.
You will have to do some performance testing to say for sure, but I suspect that CPU-side frustum culling combined with buffer orphaning of your instance UBO will give good results. Rather than reusing any data from previous frames, you would just stream the entire instance UBO from CPU to GPU each frame and let the GPU discard the old UBO when it finishes each frame.

Related

Cost of large buffer switch vs. small buffer switch

I'm creating a tile-based renderer where each tile has a vertex model. However, from each vertex model only a small portion is rendered in one frame. These subsets change every frame.
What would be the fastest way to render this? I can think of the following options:
Make one draw call for every model. Every model is stored in full on the gpu. For every draw call, the full vbo is switched every time. Indices are then used to pick the appropriate small portion for the actual rendering.
Make one draw call with one vbo which gets assembled every frame by copying the necessary (small) subset of all the other vbos (the data is copied within vram).
Make one draw call with one vbo, but the vbo is recreated every frame with the (small) subset from CPU data using glBufferData.
Which do you think is fastest, or can you think of something faster?
One deciding factor is obviously if switching between larger VBOs is more expensive than switching between smaller VBOs.
It is a bad idea to make a lot of drawcalls. In OpenGL,you will be CPU bound by this method, so it is better to batch a lot of models.
Actually, I would go for this method. All static geometry is inside one and only one VBO and one VAO. It does not mean that you only have "one draw call". However, you should use glMultiDraw*Indirect.
The idea burried that is you have to use compute shaders to perform culling on GPU, and use something like GL_INDIRECT_PARAMETERS extensions with your multi indirect draw call.
Indirect Drawing
For all dynamic geometry, you can use a persistent buffer.
To answer your question about changing vao/vbo. Change VAO, or use glBindVertexBuffer should not make a big overhead.
But you should profile it, it can depends on your driver / hardware :)

OpenGL what does glTexImage2D do?

What does gl.glTexImage2D do? The docs say it "uploads texture data". But does this mean the whole image is in GPU memory? I'd like to use one large image file for texture mapping. Further: can I simply use a VBO for uv and position coordinates to draw the texture?
Right, I am using words the wrong way here. What I meant was carrying a 2D array of UV coordinates and a 2D array of model to subsample a larger PNG image (in texture memory) onto individual tile models. My confusion here lies in not knowing how fast these fetches can take. Lets say I have a 5000x5000 pixel image. I load it as a texture. Then I create my own algorithm for fetching portions of it to draw. Where do I save myself the bandwidth for drawing these tiles? If I implement an LOD algorithm to determine which tiles are close, which are far and which are out of the camera frustum how do manage each these tiles in memory? Loaded question I know but I am struggling to find the best implementation to get started. I am developing for mobile devices with OpenGL ES 2.0.
What exactly happens when you call glTexImage2D() is system dependent, and there's no way for you to know, unless you have developer tools that allow you to track GPU and memory usage.
The only thing guaranteed is that the data you pass to the call has been consumed by the time the call returns (since the API definition allows you to modify/free the data after the call), and that the data is accessible to the GPU when it's used for rendering. Between that, anything is fair game. Keep in mind that OpenGL is a very asynchronous API. When you make API calls, the corresponding work is mostly queued up for later execution by the GPU, and is generally not completed by the time the calls return. This can include calls for uploading data.
Also, not all GPUs have "GPU memory". In fact, if you look at them by quantity, very few of them do. Mobile GPUs have caches, but mostly not VRAM in the sense of traditional discrete GPUs. How VRAM and caches are managed is highly system dependent.
With all the caveats above, and picturing a GPU that has VRAM: While it's possible that they can load the data into VRAM in the glTexImage2D() call, I would be surprised if that was commonly done. It just wouldn't make much sense to me. When a texture is loaded, you have no idea how soon it will be used for rendering. Since you don't know if all textures will fit in VRAM (and they often will not), you might have to evict it from VRAM before it was ever used. Which would obviously be very wasteful. As a general strategy, I think it will be much more efficient to load the texture data into VRAM only when you have a draw call that uses it.
Things would be somewhat different if the driver could be very confident that all texture data will fit in VRAM. But with OpenGL, there's really no reasonable way to know this ahead of time. And things get even more complicated since at least on desktop computers, you can have multiple applications running at the same time, while VRAM is a shared resource.
You are correct.
glteximage2d is the function that actually moves the texture data across to the gpu.
you will need to create the texture object first using glGenTextures() and then bind it using glBindTexture().
there is a good example of this process in the opengl redbook
example
you can then use this texture with a VBO. There are many ways to accomplish this, but interleaving your vertex coordinates, texture coordinates, and vertex normals and then telling the GPU how to unpack them with several calls to glVertexAttribPointer is the best bet as far as performance.
you are on the right track with VBOs, the old fixed pipeline GL stuff is depricated so you should just learn VBO from the outset.
this book is not 100% up to date, but it is complete and free and should serve as a great place to start learning VBO Open GL Book

OpenGL state redundancy elimination Tree, render state priorities

I am working on a Automatic OpenGL batching method in my Game Engine, to reduce draw calls and redundant calls.
My batch tree design begins with the most expensive states and adds leafs down for each less expensive state.
Example:
Tree Root: Shaders / Programs
Siblings: Blend states ...
a.s.o.
So my question is what are most likely the most expensive calls, in this list:
binding program
binding textures
binding buffers
buffering texture, vertex data
binding render targets
glEnable / glDisable
blend state equation, color, functions, colorWriteMask
depth stencil state depthFunc, stencilOperations, stencilFunction, writeMasks
Also wondering which method will be faster:
- Collect all batchable draw commands to single vertex buffer and call only 1 draw call (this method would also force to update matrix transforms per vertex on cpu side)
- Do not batch at all and render many small draw calls, only batch particle system ...
PS: Render Targets will always Pre or Post changed, depending on usage.
Progress so far:
Andon M. Coleman: Cheapest Uniform & Vertex Array Binding, Expensive FBO, Texture Bindings
datenwolf: Programs invalidate State Cache
1: Framebuffer states
2: Program
3: Texture Binding
...
N: Vertex Array binding, Uniform binding
Current execution Tree in WebGL:
Program
Attribute Pointers
Texture
Blend State
Depth State
Stencil Front / Back State
Rasterizer State
Sampler State
Bind Buffer
Draw Arrays
Each step is a sibling hash tree, to avoid checking agains state cache inside of main render queue
Loading Textures / Programs / Shaders / Buffers happens before rendering in an extra queue, for future multi threading and also to be sure that the context is initialized before doing anything with it.
The biggest problem of self rendering objects is that you cannot control when something happens, for example if a developer calls these methods before gl is initialized, he wouldn't know why but he would have some bugs or problems...
The relative costs of such operations will of course depend on the usage pattern and your general scenario. But you might find Nvidia's "Beoynd Porting" presentation slides as a useful guide. Let me reproduce especially slide 48 here:
Relative Cost of state changes
In decreasing cost...
Render Target ~60K/s
Program ~300K/s
ROP
Texture Bindings ~1.5M/s
Vertex Format
UBO Bindings
Uniform Updates ~10M/s
This does not directly match all of the bullet points of your list. E.g. glEnable/glDisable might affect anything. Also GL's buffer bindings are nothing the GPU directly sees. Buffer bindings are mainly a client side state, depending on the target, of course. Change of blending state would be a ROP state change, and so on.
This tends to be highly platform/vendor dependent. Any numbers you may find apply to a specific GPU, platform and driver version. And there are a lot of myths floating around on the internet about this topic. If you really want to know, you need to write some benchmarks, and run them across a range of platforms.
With all these caveats:
Render target (FBO) switching tends to be quite expensive. Highly platform and architecture dependent, though. For example if you have some form of tile based architecture, pending rendering that would ideally be deferred until the end of the frame may have to be completed and flushed out. Or on more "classic" architectures, there might be compressed color buffers or buffers used for early depth testing that need consideration when render targets are switched.
Updating texture or buffer data is impossible to evaluate in general terms. It obviously depends heavily on how much data is being updated. Contrary to some claims on the internet, calls like glBufferSubData() and glTexSubImage2D() do not typically cause a synchronization. But they involve data copies.
Binding programs should not be terribly expensive, but is typically still more heavyweight than the state changes below.
Texture binding is mostly relatively cheap. But it really depends on the circumstances. For example, if you use a GPU that has VRAM, and the texture is not in VRAM at the moment, it might trigger a copy of the texture data from system memory to VRAM.
Uniform updates. This is supposedly very fast on some platforms. But it's actually moderately expensive on others. So there's a lot of variability here.
Vertex state setup (including VBO and VAO binding) is typically fast. It has to be, because it's done so frequently by most apps that it can very quickly become a bottleneck. But there are similar consideration as for textures, where buffer memory may have to be copied/mapped if it was not used recently.
General state updates, like blend states, stencil state, or write masks, are generally very fast. But there can be very substantial exceptions.
Just a typical example of why characteristics can be so different between architectures: If you change blend state, that might be sending a couple of command words on one architecture, with minimal overhead. On other architectures, blending is done as part of the fragment shader. So if you change blend state, the shader program has to be modified to patch in the code for the new blending calculation.

Simple 2D Culling in OpenGL using VBOs

I am looking into using a VBO instead of immediate mode for performance reasons. I am creating a 2D orthographic scene filled with sprites. I do not want to draw sprites that are off-screen. I do this by checking their position against the screen size and position of the camera.
In immediate mode this is simple; there is draw method for each sprite. Using a VBO this seems non-trivial; I render an entire section of a VBO at one time. There would be no way for me (that I can think of) to elect out of rendering sprites that are off-screen.
I'll just assume that you do indeed animate the sprites on the CPU, because that's the only thing that makes sense in the light of your question (otherwise, how would you draw them in immediate mode initially, and how would you skip drawing some).
AGP/PCIe behaves much like a harddisk from a performance point of view. Bandwidth is huge, but access time is quite noticeable. In other words, doing a transfer at all is painful, but once you do it, a few kilobytes more don't really make any difference. Uploading 500 sprites and uploading 1000 sprites is the same thing.
Since you animate the sprites on the CPU, you already must do one transfer (glBufferSubData or glMapBuffer/glUnmapBuffer) every frame, there is no other way.
Be sure to use a "fresh" buffer e.g. by applying the glBufferData(null) idiom. This avoids pipeline stalls by allowing OpenGL to continue using (drawing from) the buffer while giving you a different buffer (without you knowing) at the same time. Later when it is done drawing, it just secretly flips buffers and throws the old one away. That way, you achieve good parallelism (which is key to performance and much more important than culling a few thousand vertices).
Also, graphics cards are reasonably good at culling geometry (this includes discarding entire triangles that are off-screen before fragments are generated). Hundreds? Thousands? Hundred thousands? No issue. Let the graphics card do it.
Unless you have a million sprites of which one half is visible at a time and the other half isn't, it is not unlikely that writing the entire buffer continuously and without branches is not only just as fast, but even faster due to cache and pipeline effects.

Better to create new VBOs or just swap the data? (OpenGL)

So in a OpenGL rendering application, is it usually better to create and maintain a vertex buffer throughout the life of an application and just swap out the data every frame with glBufferData, or is it better to just delete the VBO and recreate it every frame?
Intuition tells me it's better to swap out data, but a few sample programs I've seen does the latter, so I'm kind of confused.
I read Nvidia's whitepaper on VBOs, but as I'm a newbie to opengl, it didn't make a whole lot of sense.
Thanks in advance for and advice
Since you're generating a whole new set of data each frame the documentation seems to indicate that GL_STREAM_DRAW is the Right Way to go about things.
The significant thing about VBOs is that they are render data buffers that are stored in graphics memory and not in the computer's main memory. That makes their usage very efficient when the data in them isn't being updated (too) frequently, because every time you do that, the computer will have to transfer (potentially huge amounts of) data from main to graphics memory - which is slow.
So the ideal case is to put all required render data into VBOs once and then only manipulate them via OpenGL functions like matrix transformation or via shaders.
So you would e.g. put each mesh's world space coordinates and texture coordinates into VBOs and never directly touch them again; you'd use the modelview matrix, lighting functions and shaders to render them.
You can do more to optimize VBO usage, but that's the basics as I have understood them.
Find some good hints and more details here: How do I use OpenGL 3.x VBOs to render a dynamic world?