I've worked on a variety of demo projects with OpenGL and C++, but they've all involved simply rendering a single cube (or similarly simple mesh) with some interesting effects. For a simple scene like this, the vertex data for the cube could be stored in an inelegant global array. I'm now looking into rendering more complex scenes, with multiple objects of different types.
I think it makes sense to have different classes for different types of objects (Rock, Tree, Character, etc), but I'm wondering how to cleanly break up the data and rendering functionality for objects in the scene. Each class will store its own array of vertex positions, texture coordinates, normals, etc. However I'm not sure where to put the OpenGL calls. I'm thinking that I will have a loop (in a World or Scene class) that iterates over all the objects in the scene and renders them.
Should rendering them involve calling a render method in each object (Rock::render(), Tree::render(),...) or a single render method that takes in an object as a parameter (render(Rock), render(Tree),...)? The latter seems cleaner, since I won't have duplicate code in each class (although that could be mitigated by inheriting from a single RenderableObject class), and it allows the render() method to be easily replaced if I want to later port to DirectX. On the other hand, I'm not sure if I can keep them separate, since I might need OpenGL specific types stored in the objects anyway (vertex buffers, for example). In addition, it seems a bit cumbersome to have the render functionality separate from the object, since it will have to call lots of Get() methods to get the data from the objects. Finally, I'm not sure how this system would handle objects that have to be drawn in different ways (different shaders, different variables to pass in to the shaders, etc).
Is one of these designs clearly better than the other? In what ways can I improve upon them to keep my code well-organised and efficient?
Firstly, dont even bother with platform independence right now. wait until you have a much better idea of your architecture.
Doing a lot of draw calls/state changes is slow. The way that you do it in an engine is you generally will want to have a renderable class that can draw itself. This renderable will associated to whatever buffers it needs (e.g. vertex buffers) and other information (like vertex format, topology, index buffers etc). Shader input layouts can be associated to vertex formats.
You will want to have some primitive geo classes, but defer anything complex to some type of mesh class which handles indexed tris. For a performant app, you will want to batch up calls (and potentially data) for similar input types in your shading pipeline to minimise unneccesary state changes and pipeline flushes.
Shaders parameters and textures are generally controlled via some material class that is associated to the renderable.
Each renderable in a scene itself is usually a component of a node in a hierarchical scene graph, where each node usually inherits the transform of its ancestors through some mechanism. You will probably want a scene culler that uses a spatial partitioning scheme to do fast visibility determination and avoid draw call overhead for things out of view.
The scripting/behaviour part of most interactive 3d apps is tightly connected or hooked into its scene graph node framework and an event/messaging system.
This all fits together in a high level loop where you update each subsystem based on time and draw the scene at current frame.
Obviously there are tonnes of little details left out but it can become very complex depending on how generalised and performant you want to be and what kind of visual complexity you are aiming for.
Your question of draw(renderable), vs renderable.draw() is more or less irrelevant until you determine how all the parts fit together.
[Update] After working in this space a bit more, some added insight:
Having said that, in commercial engines, its usually more like draw(renderBatch) where each render batch is an aggregation of objects that are homogenous in some meaningful way to the GPU, since iterating over heterogeneous objects (in a "pure" OOP scene graph via polymorphism) and calling obj.draw() one-by-one has horrible cache locality and is generally an inefficient use of GPU resources. It is very useful to take a data-oriented approach to designing how an engine talks to its underlying graphics API(s) in the most efficient way possible, batching up things as much as possible without negatively effecting the code structure/readability.
A practical suggestion is to write a first engine using a naive/"pure" approach to get really familiar with the domain space. Then on a second pass (or probably rewrite), focus on the hardware: things like memory representation, cache locality, pipeline state, bandwidth, batching, and parallelism. Once you really start considering these things, you will realise that most of your initial design goes out the window. Good fun.
I think OpenSceneGraph is kind of an answer. Take a look at it and its implementation. It should provide you with some interesting insights on how to use OpenGL, C++ and OOP.
Here is what I have implemented for a physical simulation and what worked pretty well and was on a good level of abstraction. First I'd separate the functionality into classes such as:
Object - container that holds all the necessary object information
AssetManager - loads the models and textures, owns them (unique_ptr), returns a raw pointer to the resources to the object
Renderer - handles all OpenGL calls etc., allocates the buffers on GPU and returns render handles of the resources to the object (when wanting the renderer to draw the object I call the renderer giving it model render handle, texture handle and model matrix), renderer should aggregate such information o be able to draw them in batches
Physics - calculations that use the object along with it's resources (vertices especially)
Scene - connects all the above, can also hold some scene graph, depends on the nature of the application (can have multiple graphs, BVH for collisions, other representations for draw optimization etc.)
The problem is that GPU is now GPGPU (general purpose gpu) so OpenGL or Vulkan is not only a rendering framework anymore. For example physical calculations are being performed on the GPU. Therefore the renderer might now transform into something like GPUManager and other abstractions above it. Also the most optimal way to draw is in one call. In other words, one big buffer for the whole scene that can be also edited via compute shaders to prevent excessive CPU<->GPU communication.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?
Overview
In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.
The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.
Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.
We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.
GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.
Mechanism
Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.
That's not helpful.
OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.
The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.
The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.
The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:
Update the scene graph in GPU memory.
Issue one or more compute shaders that generate multi-draw indirect rendering commands.
Issue a single multi-draw indirect call that draws everything.
Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.
Implementation
Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.
See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.
So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.
State and Shaders
If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.
But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.
When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.
However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.
Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.
Buffers
Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.
Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.
That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.
Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.
But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?
Well... you can't. So you cheat.
First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.
This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.
This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.
Textures
OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.
Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?
There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.
Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.
The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.
In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.
In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.
So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.
And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.
The OpenGL tradition is to let the user manipulate OpenGL objects using an unsigned int handle. Why not just give a pointer instead? What are the advantages of unique IDs over pointers?
TL;DR: OpenGL IDs don't map bijectively to memory locations. A single OpenGL ID may refer to multiple memory locations at the same time. Also OpenGL has been designed to work for distributed rendering architectures (like X11) as well, and given an indirect context programs running on different machines may use the same OpenGL context.
OpenGL has been designed as an architecture and display system agnostic API. When OpenGL was first developed this happened in light of client-server display architectures (like X11). If you look into the OpenGL specification, even of modern OpenGL-4 it refers to clients and servers.
However in a client/server architectures pointers make no sense. For one the address space of the server is not accessible to the clients without jumping some hoops. And even if you set up a shared memory mapping, the addresses of objects are not the same for client and server. Add to this that on architectures like X11 a single indirect OpenGL context can be used by multiple clients, that may even run on different machines. Pointers simply don't work for that.
Last but not least the OpenGL object model is highly abstract and the OpenGL drawing model is asynchonous Say I do the following:
id = glGenTextures(1)
glBindTexture(id)
glTexStorage(…)
glTexSubImage(image_a)
draw_something()
glTexSubImage(image_b)
draw_someting_b()
When the end of this little snippet has reached, actually nothing at all may have been drawn yet, because no synchronization point has been reached (glFinish, glReadPixels, a buffer swap). Note the two calls to glTexSubImage, which happen on the same id. When the pixels are finally put to the framebuffer, there two different images to be sourced from a single texture ID, because OpenGL guarantees you, that things will appear as if things were drawn synchronously. So at the end of a drawing batch a single object ID may refer to a whole collection of different data sets with different locations in memory.
My first consideration - having pointers would make programmers wonder if they can operate with them in a pointer-arithmetic way, e.g. by pointing to a middle of a texture to update it or something like that. Maybe even more crazy things, such as patching shaders code on-the-fly. That all sounds like a whole new cool degree of freedom, unless you think of additional complications caused by tampering with highly efficient and optimized GPU "black-box" way of operation.
For example - consider inner workings of GPU memory allocation. Just like with OS - pointers you get from OS are not the real "physical" ones, OS memory manager can move things around behind the scenes while keeping the pointers the same (f.e. swapping to HDD). In that case IDs are just the same - GPU can optimize and pack entities with even more freedom, while keeping the nice facade of them being available at 1-2-3.
Another example - OpenGL is not actually the same across manufacturers. In fact OpenGL is just a description of API, where each vendor can make his own implementation the way it works best for him. For example there's no rule on hot to store texture mipmaps, aligned, or interleaved or whatever. Having pointers to a texture would lure developers into tampering with mipmaps, which would cause a lot of trouble to support various implementations or force all the implementations to become strictly unified, which again is a bad idea for performance.
The OpenGL device (GPU) may have its own memory with its own address space, independent of the host (CPU) memory system. (Think of a discrete video card with its own onboard RAM.) The host can't (directly) access that memory, so it's not possible to have a pointer to it.
It's best to think of the GPU as a whole separate computer; it's actually possible to do OpenGL over a network, with a program running on one computer rendering graphics on the video card in another. When you set up your textures and buffers, you're basically uploading data to the GL device for its own internal use.
Suppose I have many meshes I'd like to render. I have two choices:
Bake transforms and colors for each mesh into a VBO and render with a single draw call.
Use glUniform for transforms and colors and use many draw calls (but still a single VBO)
Assuming the scene changes very little between frames, which method tends to be better?
There are more than those two choices. At least one more comes to mind:
...
....
Use attributes for transforms and colors and use many draw calls.
Choice 3 is similar to choice 2, but setting attributes (using calls like glVertexAttrib4f) is mostly faster than setting uniforms. The efficiency of setting uniforms is highly platform dependent. But they're generally not intended to be modified very frequently. They are called uniform for a reason. :)
That being said, choice 1 might be the best for your use case where the transforms/colors change rarely. If you're not doing this yet, you could try keeping the attributes that are modified in a separate VBO (with usage GL_DYNAMIC_DRAW), and the attributes that remain constant in their own VBO (with usage GL_STATIC_DRAW). Then make the necessary updates to the dynamic buffer with glBufferSubData.
The reality is that there are no simple rules to predict what is going to perform best. It will depend on the size of your data and draw calls, how frequent and large the data changes are, and also very much on the platform you run on. If you want to be confident that you're using the most efficient solution, you need to implement all of them, and start benchmarking.
Generally, option 1 (minimize number of draw calls) is the best advice. There are a couple of caveats:
I have seen performance fall off a cliff when using very large VBOs on at least one mobile device (assuming relevant for opengl-es tag). The explanation (from the vendor) involved internal buffers exceeding a certain size.
If putting all the information which would otherwise be conveyed with uniforms into vertex attributes significantly increases the size of the vertex buffer, the price you pay (in perhaps costly memory reads) of reading redundant information (because it doesn't really vary per vertex) might negate the savings of using fewer draw calls.
As always the best (but tiresome) advice is to test (I know this is particularly hard developing for mobile where there are many potential implementations your code could be running on). Try to keep your pipeline/toolchain flexible enough that you can easily try out and compare different options.
I'm designing the sorting part of my rendering engine. I know that changing the render target, shader program, texture bindings, and more are expensive and therefore one should sort the draw order based on them to reduce state changes. However, what about sorting based on what index buffer is bound, and which vertex buffers are used for attributes?
I'm confused about these because VAOs are mandatory and they encapsulate all of that state. So should I peek behind the scenes of vertex array objects (VAOs), see what state they set and sort based on it? Or should I just not care in what order VAOs are called?
This is what confuses me so much about vertex array objects. It makes sense to me to not be switching which buffers are in use over and over and yet VAOs just seem to force one to not care about that.
Is there a general vague or not agreed on order on which to sort stuff for rendering/game engines?
I know that binding a buffer simply changes some global state but surely it must be beneficial to the hardware to draw from the same buffer multiple times, maybe some small cache coherency?
While VAOs are mandated in GL 3.1 without GL_ARB_compatibility or core 3.2+, you do not have to use them the way they are intended... that is to say, you can bind a single VAO for the duration of your application's lifetime and continue to bind and unbind VBOs, etc. the traditional way if this somehow makes your life easier. Valve is famous for advocating doing this in their presentation on porting the Source engine from D3D to GL... I tend to disagree with them on some points though. A lot of things that they mention in their presentation make me cringe as someone who has years of experience with both D3D and OpenGL; they are making suggestions on how to port something to an API they have a minimal working knowledge of.
Getting back to your performance concern though, there can be validation overhead for changing bound resources frequently, so it is actually more than just "simply changing a global state." All GL commands have to do validation in order to determine if they need to set an error state. They will validate your input parameters (which is pretty trivial), as well as the state of any resource the command needs to use (this can be complicated).
Other types of GL objects like FBOs, textures and GLSL programs have more rigorous validation and more complicated memory dependencies than buffer objects and vertex arrays do. Swapping a vertex pointer should be cheaper in the grand scheme of things than most other kinds of object bindings, especially since a lot of stuff can be deferred by an implementation until you actually issue a glDrawElements (...) command.
Nevertheless, the best way to tackle this problem is just to increase reuse of vertex buffers. Object reuse is pretty high to begin with for vertex buffers, if you have 200 instances of the same opaque model in a scene you can potentially draw all 200 of them back-to-back and never have to change a vertex pointer. Materials tend to change far more frequently than actual vertex buffers, and so you would generally sort your drawing first and foremost by material (sub-sorted by associated states like opaque/translucent, texture(s), shader(s), etc.). You can add another level to batch sorting to draw all batches that share the same vertex data after they have been sorted by material. The ultimate goal is usually to minimize the number of draw commands necessary to complete your frame, and using priority/hierarchy-based sorting with emphasis on material often delivers the best results.
Furthermore, if you can fit multiple LODs of your model into a single vertex buffer, instead of swapping between different vertex buffers sometimes you can just draw different sets of indices or even just a different range of indices from a single index buffer. In a very similar way, texture swapping pressure can be alleviated by using packed texture atlases / sprite sheets instead of a single texture object for each texture.
You can definitely squeeze out some performance by reducing the number of changes to vertex array state, but the takeaway message here is that vertex array state is pretty cheap compared to a lot of other states that change frequently. If you can quickly implement a secondary sort to reduce vertex state changes then go for it, but I would not invest a lot of time in anything more sophisticated unless you know it is a bottleneck. Prioritize texture, shader and framebuffer state first as a general rule.
I need to know how I can render many different 3D models, which change their geometry to each frame (are animated models), don't repeat models and textures.
I carry all models and for each created an "object" model class.
What is the most optimal way to render them?
To use 1 VBO for each 3D model
To use a single VBO for all models (to be all different, I do not see this option possible)
I work with OpenGL 3.x or higher, C++ on Windows.
TL; DR - there's no silver bullet when it comes to rendering performance
Why is that? That depends on the complicated process that gets your data, converts it, pushes it to GPU and then makes pixels on the screen flicker. So, instead of "one best way", a few of guideliness appeared that might usually improve the performance.
Keep all the necessary data on the GPU (because the closer to the screen, the shorter way electrons have to go :))
Send as little data to GPU between frames as possible
Don't sync needlessly between CPU and GPU (that's like trying to run two high speed trains on parallel tracks, but insisting on slowing them down to the point where you can pass something through the window every once in a while),
Now, it's obvious that if you want to have a model that will change, you can't have the cake and eat it. You have to made tradeoffs. Simply put, dynamic objects will never render as fast as static ones. So, what should you do?
Hint GPU about the data usage (GL_STREAM_DRAW or GL_DYNAMIC_DRAW) - that should guarantee optimal memory arrangement.
Don't use interleaved buffers to mix static vertex attributes with dynamic ones - if you divide the memory, you can batch-update the geometry leaving texture coordinates intact, for example.
Try to do as much as you can purely on the GPU - with compute shaders and transform feedback, it might well be possible to store whole animation data as a buffer itself and calculate it on GPU, avoiding expensive syncs.
And last but not least, always carefully measure the impact of your change on performance. Going blindly won't help. Measure accurately and thoroughly (even stuff like shader compilation time might matter sometimes!). Then, even if you go by trial-and-error, there's a hope you'll get somewhere.
And to address one of your points in particular; whether it's one large VBO and a few smaller ones doesn't really matter, but a huge one might have problems in fitting in memory. You can still update parts of it, and what matters most is the memory arrangement inside of it.