Vertex Buffers - indexed or direct, interlaced or separate - opengl

What are some common guidelines in choosing vertex buffer type? When should we use interlaced buffers for vertex data, and when separate ones? When should we use an index array and when direct vertex data?
I'm searching for some common quidelines - I some cases where one or the opposite fits better, but not all cases are easily solvable. What should one have in mind choosing the vertex buffer format when aiming for performance?
Links to web resources on the topic are also welcome.

First of all, you can find some useful information on the OpenGL wiki. Second of all, if in doubt, profile, there are some rules-of-thumb about this one but experience can vary based on the data set, hardware, drivers, ... .
Indexed versus direct rendering
I would almost always by default use the indexed method for vertex buffers. The main reason for this is the so called post-transform cache. It's a cache kept after the vertex processing stage of your graphics pipeline. Essentially it means that if you use a vertex multiple times you have a good chance of hitting this cache and being able to skip the vertex computation. There is one condition to even hit this cache and that is that you need to use indexed buffers, it won't work without them as the index is a part of this cache's key.
Also, you likely will save storage, an index can be as small as you want (1 byte, 2 byte) and you can reuse a full vertex specification. Suppose that a vertex and all attributes total to about 30 bytes of data and you share this vertex over let's say 2 polygons. With indexed rendering (2 byte indices) this will cost you 2*index_size+attribute_size = 34 byte. With non-indexed rendering this will cost you 60 bytes. Often your vertices will be shared more than twice.
Is index-based rendering always better? No, there might be scenarios where it's worse. For very simple applications it might not be worth the code overhead to set up an index-based data model. Also, when your attributes are not shared over polygons (e.g. normal per-polygon instead of per-vertex) there is likely no vertex-sharing at all and IBO's won't give a benefit, only overhead.
Next to that, while it enables the post-transform cache, it does make generic memory cache performance worse. Because you access the attributes relatively random, you might have quite some more cache misses and memory prefetching (if this would be done on the GPU) won't work decently. So it might be (but measure) that if you have enough memory and your vertex shader is extremely simple that the non-indexed version outperforms the indexed version.
Interleaving vs non-interleaving vs buffer per-attribute
This story is a bit more subtle and I think it comes down to weighing some properties of your attributes.
Interleaved might be better because all attributes will be close together and likely be in a few memory cachelines (maybe even a single one). Obviously, this can mean better peformance. However, combined with indexed-based rendering your memory access is quite random anyway and the benefit might be smaller than you'd expect.
Know which attributes are static and which are dynamic. If you have 5 attributes of which 2 are completely static, 1 changes every 15 minutes and 2 every 10 seconds, consider putting them in 2 or 3 separate buffers. You don't want to re-upload all 5 attributes every time those 2 most frequent change.
Consider that attributes should be aligned on 4 bytes. So you might want to take interleaving even one step further from time to time. Suppose you have a vec3 1-byte attribute and some scalar 1-byte attribute, naively this will need 8 bytes. You might gain a lot by putting them together in a single vec4, which should reduce usage to 4 bytes.
Play with buffer size, a too large buffer or too many small buffers may impact performance. But this is likely very dependent on the hardware, driver and OpenGL implementation.

Indexed vs Direct
Let's see what you get by indexing. Every repeating vertex, that is, a vertex with "smooth" break will cost you less. Every singular "edge" vertex will cost you more. For data that's based on real world and is relatively dense, one vertex will belong to many triangles, and thus indexes will speed it up. For procedurally generated arbitrary data, direct mode will usually be better.
Indexed buffers also add additional complications to the code.
Interleaved vs Separate
The main difference here is actually based on a question "will I want to update only one component?". If the answer is yes, then you shouldn't interleave, because any update will be extremely costly. If it's no, using interleaved buffers should improve locality of reference and generally be faster on most of the hardware.

Related

Modern Modelling Formats that Support Vertex Buffers

Are there any modeling formats that directly support Vertex Buffer Objects?
Currently my game engine has been using Wavefront Models, but I have always been using them with immediate mode and display lists. This works, but I wanted to upgrade my entire system to modern OpenGL, including Shaders. I know that I can use immediate mode and display lists with Shaders, but like most aspiring developers, I want my game to be the best it can be. After asking the question linked above, I quickly came to the realization that Wavefront Models simply don't support Vertex Buffers; this is mainly due to the fact of how the model is indexed. In order for a Vertex Buffer Object to be used, Vertices, Texture Coordinates, and the Normal arrays all need to be equal in length.
I can achieve this by writing my own converter, which I have done. Essentially I unroll the indexing and create the associated arrays. I don't even need to exactly use glDrawElements then, I can just use glDrawArrays, which I'm perfectly fine doing. The only problem is that I am actually duplicating data; the arrays become massive(especially with large models), and this just seems wrong to me. Certainly there has to be a modern way of initializing a model into a Vertex Buffer without completely unrolling the indexing. So I have two questions.
1. Are their any modern model formats/concepts that support direct Vertex Buffer Objects?
2. Is this already an industry standard? Do most game engines unroll the indexing(and inflate the arrays also called unpacking) at runtime to create the game world assets?
The primary concern with storage formats is space efficiency. Reading from storage media you're limited by I/O bandwidth by large. So any CPU cycles you can invest to reduce the total amount of data to be read from storage will hugely benefit asset loading times. Just to give you the general idea. Even the fastest SSDs you can currently buy at the time of writing this won't get over 5GiB/s (believe me, I tried sourcing something that can saturate 8 lanes of PCIe-3 for my work). Your typical CPU memory bandwidth is at least one order of magnitude above that. GPUs have even more memory bandwidth. Even faster are lower level caches.
So what I'm trying to tell you: That index unrolling overhead? It's mostly an inconvenience for you, the developer, but probably shaves off some time from loading the assets.
(suggested edit): Of course storing numbers in their text representation is not going to help with space efficiency; depending on the choice of base a single digit represents between 3 to 5 bits (lets say 4 bits). That same text character however consumes 8 bits, so you have about 100% overhead there. The lowest hanging fruit this is storing in a binary format.
But why stop there? How about applying compression on the data? There are a number of compressed asset formats. But one particularly well developed one is OpenCTM, although it would make some sense to add one of the recently developed compression algorithms to it. I'm thinking of Zstandard here, which compresses data ridiculously well and at the same time is obscenely fast at decompression.

Why does OpenGL not support multiple index buffering?

Why does OpenGL not support multiple index buffers for vertex attributes (yet)?
To me it seems very useful, since you could reuse attributes and you would have a lot more control over the rendering of your geometry.
Is there a reason why all attribute arrays have to take the same index or could this feature be available in the near future?
OpenGL (and D3D. And Metal. And Mantle. And Vulkan) doesn't support this because hardware doesn't support this. Hardware doesn't support this because, for the vast majority of mesh data, this would not help. This is primarily useful for meshes that are predominantly not smooth (vertices sharing positions but not normals and so forth). And most meshes are smooth.
Furthermore, it will frequently be a memory-vs-performance tradeoff. Accessing your vertex data will likely be slower. The GPU has to fetch from two distinct locations in memory, compared to the case of a single interleaved fetch. And while caching helps, the cache coherency of multi-indexed accesses is much harder to control than for single-indexed accesses.
Hardware is unlikely to support this for that reason. But it also is unlikely to support it because you can do it yourself. Whether through buffer textures, image load/store or SSBOs, you can get your vertex data however you want nowadays. And since you can, there's really no reason for hardware makers to develop special hardware to help you.
Also, there are questions as to whether you'd really be making your vertex data smaller at all. In multi-indexed rendering, each vertex is defined by a set of indices. Well, each index takes up space. If you have more than 64K of attributes in a model (hardly an unreasonable number in many cases), then you'll need 4 bytes per index.
A normal can be provided in 4 bytes, using GL_INT_2_10_10_10_REV and normalization. A 2D texture coordinate can be stored in 4 bytes too, as a pair of shorts. Colors can be stored in 4 bytes. So unless multiple attributes share the same index (normals and texture coordinate edges happen at the same place, as might happen on a cube), you will actually make your data bigger by doing this in many cases.

Vertex batches (geometry groups) and maximum VBO (vertex buffer) size

I did a lot of researches concerning the way to gather vertex data into groups commonly called batches.
Here's for me the 2 main interesting articles on the subject:
https://www.opengl.org/wiki/Vertex_Specification_Best_Practices
http://www.gamedev.net/page/resources/_/technical/opengl/opengl-batch-rendering-r3900
The first article explains what are the best practices on how to manipulate VBOs (max size, format etc).
The second presents a simple example on how to manage vertex memory using batches. According to the author each batch HAS TO contains an instance of a VBO (plus a VAO) and he insists strongly on the fact that the maximimum size of a VBO is ranged between 1Mo (1000000 bytes) to 4Mo (4000000 bytes). The first article advice the same thing. I quote "1MB to 4MB is a nice size according to one nVidia document. The driver can do memory management more easily. It should be the same case for all other implementations as well like ATI/AMD, Intel, SiS."
I have several questions:
1) Does the maximum byte size mentionned above is an absolute rule ? Is it so bad to allocate VBO with a byte size more important than 4Mo (for example 10 Mo) ?
2) How can we do concerning meshes with a total vertex byte size larger than 4Mo? Do I need to split the geometry into several batches?
3) Does a batch inevitably store as attribute a unique VBO or several batches can be store in a single VBO ? (It's two different ways but the first one seems to be the right choice). Are you agree ?
According to the author each batch handle a unique VBO with a maximum size between 1 and 4 Mo and the whole VBO HAS TO contain only vertex data sharing the same material and transformation information). So if I have to batch an other mesh with a different material (so the vertices can't be merged with existing bathes) I have to create a new batch with a NEW vbo instanciated.
So according to the author my second method is not correct : it's not adviced to store several batches into a single VBO.
Does the maximum byte size mentionned above is an absolute rule ? Is it so bad to allocate VBO with a byte size more important than 4Mo (for example 10 Mo) ?
No.
That was a (very) old piece of info that is not necessarily valid on modern hardware.
The issue that led to the 4MB suggestion was about the driver being able to manage memory. If you allocated more memory than the GPU had, it would need to page some in and out. If you use smaller chunks for your buffer objects, the driver is more easily able to pick whole buffers to page out (because they're not in use at present).
However, this does not matter so much. The best thing you can do for performance is to avoid exceeding memory limits entirely. Paging things in and out hurts performance.
So don't worry about it. Note that I have removed this "advice" from the Wiki.
So according to the author my second method is not correct : it's not adviced to store several batches into a single VBO.
I think you're confusing the word "batch" with "mesh". But that's perfectly understandable; the author of that document you read doesn't seem to recognize the difference either.
For the purposes of this discussion, a "mesh" is a thing that is rendered with a single rendering command, which is conceptually separate from other things you would render. Meshes get rendered with certain state.
A "batch" refers to one or more meshes that could have been rendered with separate rendering commands. However, in order to improve performance, you use techniques to allow them all to be rendered with the same rendering command. That's all a batch is.
"Batching" is the process of taking a sequence of meshes and making it possible to render them as a batch. Instanced rendering is one form of batching; each instance is a separate "mesh", but you are rendering lots of them with one rendering call. They use their instance count to fetch their per-instance state data.
Batching takes many forms beyond instanced rendering. Batching often happens at the level of the artist. While the modeller/texture artist may want to split a character into separate pieces, each with their own textures and materials, the graphics programmer tells them to keep them as a single mesh that can be rendered with the same textures/materials.
With better hardware, the rules for batching can be reduced. With array textures, you can give each mesh a particular ID, which it uses to pick which array layer it uses when fetching textures. This allows the artists to give such characters more texture variety without breaking the batch into multiple rendering calls. Ubershaders are another form, where the shader uses that ID to decide how to do lighting rather than (or in addition to) texture fetching.
The kind of batching that the person you're citing is talking about is... well, very confused.
What do you think about that?
Well, quite frankly I think the person from your second link should be ignored. The very first line of his code: class Batch sealed is not valid C++. It's some C++/CX Microsoft invention, which is fine in that context. But he's trying to pass this off as pure C++; that's not fine.
I'm also not particularly impressed by his code quality. He contradicts himself a lot. For example, he talks about the importance of being able to allocate reasonable-sized chunks of memory, so that the driver can more freely move things around. But his GuiVertex class is horribly bloated. It uses a full 16 bytes, four floats, just for colors. 4 bytes (as normalized unsigned integers) would have been sufficient. Similarly, his texture coordinates are floats, when shorts (as unsigned normalized integers) would have been fine for his use case. That would cut his per-vertex cost down from 32 bytes to 10; that's more than a 3:1 reduction.
4MB goes a lot longer when you use reasonably sized vertex data. And the best part? The OpenGL Wiki page he linked to tells you to do exactly this. But he doesn't do it.
Not to mention, he has apparently written this batch manager for a GUI (as alluded to by his GuiVertex type). Yet GUIs are probably the least batch-friendly rendering scenario in game development. You're frequently having to change state like bound textures, the current program (which reads from the texture or not), blending modes, the scissor box, etc.
Now with modern GPUs, there are certainly ways to make GUI renderers a lot more batch-friendly. But he never talks about them. He doesn't mention techniques to use gl_ClipDistance as a way to do scissor boxes with per-vertex data. He doesn't talk about ubershader usage, nor does his vertex format provide an ID that would allow such a thing.
As previously stated, batching is all about not having state changes between objects. But he focuses entirely on vertex state. He doesn't talk about textures, programs, etc. He doesn't talk about techniques to allow multiple objects to be part of the same batch while still having separate transforms.
His class can't really be used for batching of anything that couldn't have just been a single mesh.

Cache Friendly Vertex Definition

I am writing an opengl application and for vertices, normals, and colors, I am using separate buffers as follows:
GLuint vertex_buffer, normal_buffer, color_buffer;
My supervisor tells me that if I define an struct like:
struct vertex {
glm::vec3 pos;
glm::vec3 normal;
glm::vec3 color;
};
GLuint vertex_buffer;
and then define a buffer of these vertices, my application will gets so much faster because when the position is cached the normals and colors will be in cache line.
What I think is that defining such struct is not having that much affect on the performance because defining the vertex like the struct will cause less vertices in the cacheline while defining them as separate buffers, will cause to have 3 different cache lines for positions, normals and colors in the cache. So, nothing has been changed. Is that true?
First of all, using separate buffers for different vertex attributes may not be a good technique.
Very important factor here is GPU architecture. Most (especially modern) GPUs have multiple cache lines (data for Input Assembler stage, uniforms, textures), but fetching input attributes from multiple VBOs can be inefficient anyway (always profile!). Defining them in interleaved format can help improve performance:
And that's what you would get, if you used such struct.
However, that's not always true (again, always profile!) - although interleaved data is more GPU-friendly, it needs to be properly aligned and can take significantly more space in memory.
But, in general:
Interleaved data formats:
Cause less GPU cache pressure, because the vertex coordinate and attributes of a single vertex aren't scattered all over in memory.
They fit consecutively into few cache lines, whereas scattered
attributes could cause more cache updates and therefore evictions. The
worst case scenario could be one (attribute) element per cache line at
a time because of distant memory locations, while vertices get pulled
in a non-deterministic/non-contiguous manner, where possibly no
prediction and prefetching kicks in. GPUs are very similar to CPUs in
this matter.
Are also very useful for various external formats, which satisfy the deprecated interleaved formats, where datasets of compatible data
sources can be read straight into mapped GPU memory. I ended up
re-implementing these interleaved formats with the current API for
exactly those reasons.
Should be layouted alignment friendly just like simple arrays. Mixing various data types with different size/alignment requirements
may need padding to be GPU and CPU friendly. This is the only downside
I know of, appart from the more difficult implementation.
Do not prevent you from pointing to single attrib arrays in them for sharing.
Source
Further reads:
Best Practices for Working with Vertex Data
Vertex Specification Best Practices
Depends on the GPU architecture.
Most GPUs will have multiple cache lines (some for uniforms, others for vertex attributes, others for texture sampling)
Also when the vertex shader is nearly done the GPU can pre-fetch the next set of attributes into the cache. So that by the time the vertex shader is done the next attributes are right there ready to be loaded into the registers.
tl;dr don't bother with these "rule of thumbs" unless you actually profile it or know the actual architecture of the GPU.
Tell your supervisor "premature optimization is the root of all evil" – Donald E. Knuth. But don't forget the next sentence "but that doesn't mean we shouldn't optimize hot spots".
So did you actually profile the differences?
Anyway, the layout of your vertex data is not critical for caching efficiency on modern GPUs. It used to be on old GPUs (ca. 2000), which is why there were functions for interleaving vertex data. But these days it's pretty much a non-issue.
That has to do with the way modern GPUs access memory and in fact modern GPUs' cache lines are not index by memory address, but by access pattern (i.e. the first distinct memory access in a shader gets the first cache line, the second one the second cache line, and so on).

glUniform vs. single draw call performance

Suppose I have many meshes I'd like to render. I have two choices:
Bake transforms and colors for each mesh into a VBO and render with a single draw call.
Use glUniform for transforms and colors and use many draw calls (but still a single VBO)
Assuming the scene changes very little between frames, which method tends to be better?
There are more than those two choices. At least one more comes to mind:
...
....
Use attributes for transforms and colors and use many draw calls.
Choice 3 is similar to choice 2, but setting attributes (using calls like glVertexAttrib4f) is mostly faster than setting uniforms. The efficiency of setting uniforms is highly platform dependent. But they're generally not intended to be modified very frequently. They are called uniform for a reason. :)
That being said, choice 1 might be the best for your use case where the transforms/colors change rarely. If you're not doing this yet, you could try keeping the attributes that are modified in a separate VBO (with usage GL_DYNAMIC_DRAW), and the attributes that remain constant in their own VBO (with usage GL_STATIC_DRAW). Then make the necessary updates to the dynamic buffer with glBufferSubData.
The reality is that there are no simple rules to predict what is going to perform best. It will depend on the size of your data and draw calls, how frequent and large the data changes are, and also very much on the platform you run on. If you want to be confident that you're using the most efficient solution, you need to implement all of them, and start benchmarking.
Generally, option 1 (minimize number of draw calls) is the best advice. There are a couple of caveats:
I have seen performance fall off a cliff when using very large VBOs on at least one mobile device (assuming relevant for opengl-es tag). The explanation (from the vendor) involved internal buffers exceeding a certain size.
If putting all the information which would otherwise be conveyed with uniforms into vertex attributes significantly increases the size of the vertex buffer, the price you pay (in perhaps costly memory reads) of reading redundant information (because it doesn't really vary per vertex) might negate the savings of using fewer draw calls.
As always the best (but tiresome) advice is to test (I know this is particularly hard developing for mobile where there are many potential implementations your code could be running on). Try to keep your pipeline/toolchain flexible enough that you can easily try out and compare different options.