Are there any modeling formats that directly support Vertex Buffer Objects?
Currently my game engine has been using Wavefront Models, but I have always been using them with immediate mode and display lists. This works, but I wanted to upgrade my entire system to modern OpenGL, including Shaders. I know that I can use immediate mode and display lists with Shaders, but like most aspiring developers, I want my game to be the best it can be. After asking the question linked above, I quickly came to the realization that Wavefront Models simply don't support Vertex Buffers; this is mainly due to the fact of how the model is indexed. In order for a Vertex Buffer Object to be used, Vertices, Texture Coordinates, and the Normal arrays all need to be equal in length.
I can achieve this by writing my own converter, which I have done. Essentially I unroll the indexing and create the associated arrays. I don't even need to exactly use glDrawElements then, I can just use glDrawArrays, which I'm perfectly fine doing. The only problem is that I am actually duplicating data; the arrays become massive(especially with large models), and this just seems wrong to me. Certainly there has to be a modern way of initializing a model into a Vertex Buffer without completely unrolling the indexing. So I have two questions.
1. Are their any modern model formats/concepts that support direct Vertex Buffer Objects?
2. Is this already an industry standard? Do most game engines unroll the indexing(and inflate the arrays also called unpacking) at runtime to create the game world assets?
The primary concern with storage formats is space efficiency. Reading from storage media you're limited by I/O bandwidth by large. So any CPU cycles you can invest to reduce the total amount of data to be read from storage will hugely benefit asset loading times. Just to give you the general idea. Even the fastest SSDs you can currently buy at the time of writing this won't get over 5GiB/s (believe me, I tried sourcing something that can saturate 8 lanes of PCIe-3 for my work). Your typical CPU memory bandwidth is at least one order of magnitude above that. GPUs have even more memory bandwidth. Even faster are lower level caches.
So what I'm trying to tell you: That index unrolling overhead? It's mostly an inconvenience for you, the developer, but probably shaves off some time from loading the assets.
(suggested edit): Of course storing numbers in their text representation is not going to help with space efficiency; depending on the choice of base a single digit represents between 3 to 5 bits (lets say 4 bits). That same text character however consumes 8 bits, so you have about 100% overhead there. The lowest hanging fruit this is storing in a binary format.
But why stop there? How about applying compression on the data? There are a number of compressed asset formats. But one particularly well developed one is OpenCTM, although it would make some sense to add one of the recently developed compression algorithms to it. I'm thinking of Zstandard here, which compresses data ridiculously well and at the same time is obscenely fast at decompression.
Related
I'm drawing simple 3D shapes and I was wondering in the long run is it better to only use 1 buffer to store all the data of your vertices?
Right now I have arrays of vertex data (positions and colors, per vertex) and I am pushing them to their own separate buffers.
But if I use stride and offset, I could join them into one array but that would become messier and harder to manage.
What is the "traditional" way of doing this?
It feels much cleaner and organized to have separate buffers for each piece of data, but I would imagine it's less efficient.
Is the efficiency increase worth putting it all into a single buffer?
The answer to this is highly usage-dependent.
If all your vertex attributes are highly volatile or highly static, you would probably benefit from interleaving and keeping them all together, as mentioned in the comments.
However, separating the data can yield better performance if one attribute is far more volatile than others. For example, if you have a mesh where you're often changing the vertex positions, but never the texture coordinates, you might benefit from keeping them separate: you would only need to re-upload the positions to the video card, instead of the whole set of attributes. An example of this might be a CPU-driven cloth simulation.
It is also hardware and implementation dependent. Interleaving isn't helpful everywhere, but I've never heard of it having a negative impact. If you can use it, you probably should.
However, since you can't properly interleave if you split the attributes, you're essentially comparing the performance impacts of two unknowns. Will interleaving help on your target hardware/drivers? Will your data benefit from being split? The first there's no real answer to. The second is between you and your data.
Personally, I would suggest just using interleaved single blocks of vertex attributes unless you have a highly specialized need. It cuts the complexity, as opposed to needing to have potentially different systems mixed together in the same back end.
On the other hand, setting up interleaving is a rather complex task as far as memory addressing goes in C++. If you're not designing an entire graphics engine from scratch, I really doubt it's worth the effort for you. But again, that's up to you and your application.
In theory, though, merely grouping together the data you were going to upload to the video card regardless should have little impact. It might be slightly more efficient to group all the attributes together due to reducing the number of calls, but that's again going to be highly driver-dependent.
Unfortunately, I think the simple answer to your question boils down to: "it depends" and "no one really knows".
I did a lot of researches concerning the way to gather vertex data into groups commonly called batches.
Here's for me the 2 main interesting articles on the subject:
https://www.opengl.org/wiki/Vertex_Specification_Best_Practices
http://www.gamedev.net/page/resources/_/technical/opengl/opengl-batch-rendering-r3900
The first article explains what are the best practices on how to manipulate VBOs (max size, format etc).
The second presents a simple example on how to manage vertex memory using batches. According to the author each batch HAS TO contains an instance of a VBO (plus a VAO) and he insists strongly on the fact that the maximimum size of a VBO is ranged between 1Mo (1000000 bytes) to 4Mo (4000000 bytes). The first article advice the same thing. I quote "1MB to 4MB is a nice size according to one nVidia document. The driver can do memory management more easily. It should be the same case for all other implementations as well like ATI/AMD, Intel, SiS."
I have several questions:
1) Does the maximum byte size mentionned above is an absolute rule ? Is it so bad to allocate VBO with a byte size more important than 4Mo (for example 10 Mo) ?
2) How can we do concerning meshes with a total vertex byte size larger than 4Mo? Do I need to split the geometry into several batches?
3) Does a batch inevitably store as attribute a unique VBO or several batches can be store in a single VBO ? (It's two different ways but the first one seems to be the right choice). Are you agree ?
According to the author each batch handle a unique VBO with a maximum size between 1 and 4 Mo and the whole VBO HAS TO contain only vertex data sharing the same material and transformation information). So if I have to batch an other mesh with a different material (so the vertices can't be merged with existing bathes) I have to create a new batch with a NEW vbo instanciated.
So according to the author my second method is not correct : it's not adviced to store several batches into a single VBO.
Does the maximum byte size mentionned above is an absolute rule ? Is it so bad to allocate VBO with a byte size more important than 4Mo (for example 10 Mo) ?
No.
That was a (very) old piece of info that is not necessarily valid on modern hardware.
The issue that led to the 4MB suggestion was about the driver being able to manage memory. If you allocated more memory than the GPU had, it would need to page some in and out. If you use smaller chunks for your buffer objects, the driver is more easily able to pick whole buffers to page out (because they're not in use at present).
However, this does not matter so much. The best thing you can do for performance is to avoid exceeding memory limits entirely. Paging things in and out hurts performance.
So don't worry about it. Note that I have removed this "advice" from the Wiki.
So according to the author my second method is not correct : it's not adviced to store several batches into a single VBO.
I think you're confusing the word "batch" with "mesh". But that's perfectly understandable; the author of that document you read doesn't seem to recognize the difference either.
For the purposes of this discussion, a "mesh" is a thing that is rendered with a single rendering command, which is conceptually separate from other things you would render. Meshes get rendered with certain state.
A "batch" refers to one or more meshes that could have been rendered with separate rendering commands. However, in order to improve performance, you use techniques to allow them all to be rendered with the same rendering command. That's all a batch is.
"Batching" is the process of taking a sequence of meshes and making it possible to render them as a batch. Instanced rendering is one form of batching; each instance is a separate "mesh", but you are rendering lots of them with one rendering call. They use their instance count to fetch their per-instance state data.
Batching takes many forms beyond instanced rendering. Batching often happens at the level of the artist. While the modeller/texture artist may want to split a character into separate pieces, each with their own textures and materials, the graphics programmer tells them to keep them as a single mesh that can be rendered with the same textures/materials.
With better hardware, the rules for batching can be reduced. With array textures, you can give each mesh a particular ID, which it uses to pick which array layer it uses when fetching textures. This allows the artists to give such characters more texture variety without breaking the batch into multiple rendering calls. Ubershaders are another form, where the shader uses that ID to decide how to do lighting rather than (or in addition to) texture fetching.
The kind of batching that the person you're citing is talking about is... well, very confused.
What do you think about that?
Well, quite frankly I think the person from your second link should be ignored. The very first line of his code: class Batch sealed is not valid C++. It's some C++/CX Microsoft invention, which is fine in that context. But he's trying to pass this off as pure C++; that's not fine.
I'm also not particularly impressed by his code quality. He contradicts himself a lot. For example, he talks about the importance of being able to allocate reasonable-sized chunks of memory, so that the driver can more freely move things around. But his GuiVertex class is horribly bloated. It uses a full 16 bytes, four floats, just for colors. 4 bytes (as normalized unsigned integers) would have been sufficient. Similarly, his texture coordinates are floats, when shorts (as unsigned normalized integers) would have been fine for his use case. That would cut his per-vertex cost down from 32 bytes to 10; that's more than a 3:1 reduction.
4MB goes a lot longer when you use reasonably sized vertex data. And the best part? The OpenGL Wiki page he linked to tells you to do exactly this. But he doesn't do it.
Not to mention, he has apparently written this batch manager for a GUI (as alluded to by his GuiVertex type). Yet GUIs are probably the least batch-friendly rendering scenario in game development. You're frequently having to change state like bound textures, the current program (which reads from the texture or not), blending modes, the scissor box, etc.
Now with modern GPUs, there are certainly ways to make GUI renderers a lot more batch-friendly. But he never talks about them. He doesn't mention techniques to use gl_ClipDistance as a way to do scissor boxes with per-vertex data. He doesn't talk about ubershader usage, nor does his vertex format provide an ID that would allow such a thing.
As previously stated, batching is all about not having state changes between objects. But he focuses entirely on vertex state. He doesn't talk about textures, programs, etc. He doesn't talk about techniques to allow multiple objects to be part of the same batch while still having separate transforms.
His class can't really be used for batching of anything that couldn't have just been a single mesh.
Doing some maintenance on an old project and was asked by the client to see if it was possible to improve performance. I've done the parts I know and can easily test but then I tested
glColorPointer(4,GL_UNSIGNED_BYTE,...,...)
vs
glColorPointer(4,GL_FLOAT,...,...)
I could see literally no difference on the handful of machines I could test it on. Obviously it means thats not a bottleneck but since it's the first time I've been in a situation where I have access to both color formats it's also the first time I can wonder if there's a speed difference between the 2.
I'm expecting the answer is internally opengl adapters use float colors so it would be preferable to use float when available but anyone have a more definitive answer then that?
edit: the client has a few dozen machines that are ~10 year old and the project is used on those machines if it makes a difference
There's really no generally valid answer. You did the right thing by testing.
At least on desktop GPUs, it's fairly safe to assume that they will internally operate with 32-bit floats. On mobile GPUs, lower precision formats are more common, and you have some control over it using precision qualifiers in the shader code.
Assuming that 32-bit floats are used internally, there are two competing considerations:
If you specify the colors in a different format, like GL_UNSIGNED_BYTE, a conversion is needed while fetching the vertex data.
If you specify the colors in a more compact format, the vertex data uses less memory. This also has the effect that less memory bandwidth is consumed for fetching the data, with fewer cache misses, and potentially less cache pollution.
Which of these is more relevant really depends on the exact hardware, and the overall workload. The format conversion for item 1 can potentially be almost free if the hardware supports the byte format as part of fixed function vertex fetching hardware. Otherwise, it can add a little overhead.
Saving memory bandwidth is always a good thing. So by default, I would think that using the most compact representation is more likely to be beneficial. But testing and measuring is the only conclusive way to decide.
In reality, it's fairly rare that fetching vertex data is a major bottleneck in the pipeline. It does happen, but it's just not very common. So it's not surprising that you couldn't measure a difference.
For example, in a lot of use cases, texture data is overall much bigger than vertex data. If that is the case, the bandwidth consumed by texture sampling is often much more significant than the one used by vertex fetching. Also, related to this, there are mostly many more fragments than vertices, so anything related to fragment processing is much more performance critical than vertex processing.
On top of this, many applications make too many OpenGL API calls, or use the API in inefficient ways, and end up being limited by CPU overhead, particularly on very high performance GPUs. If you're optimizing performance for an existing app, that is pretty much the first thing you should check: Find out if you're CPU or GPU limited.
I need to know how I can render many different 3D models, which change their geometry to each frame (are animated models), don't repeat models and textures.
I carry all models and for each created an "object" model class.
What is the most optimal way to render them?
To use 1 VBO for each 3D model
To use a single VBO for all models (to be all different, I do not see this option possible)
I work with OpenGL 3.x or higher, C++ on Windows.
TL; DR - there's no silver bullet when it comes to rendering performance
Why is that? That depends on the complicated process that gets your data, converts it, pushes it to GPU and then makes pixels on the screen flicker. So, instead of "one best way", a few of guideliness appeared that might usually improve the performance.
Keep all the necessary data on the GPU (because the closer to the screen, the shorter way electrons have to go :))
Send as little data to GPU between frames as possible
Don't sync needlessly between CPU and GPU (that's like trying to run two high speed trains on parallel tracks, but insisting on slowing them down to the point where you can pass something through the window every once in a while),
Now, it's obvious that if you want to have a model that will change, you can't have the cake and eat it. You have to made tradeoffs. Simply put, dynamic objects will never render as fast as static ones. So, what should you do?
Hint GPU about the data usage (GL_STREAM_DRAW or GL_DYNAMIC_DRAW) - that should guarantee optimal memory arrangement.
Don't use interleaved buffers to mix static vertex attributes with dynamic ones - if you divide the memory, you can batch-update the geometry leaving texture coordinates intact, for example.
Try to do as much as you can purely on the GPU - with compute shaders and transform feedback, it might well be possible to store whole animation data as a buffer itself and calculate it on GPU, avoiding expensive syncs.
And last but not least, always carefully measure the impact of your change on performance. Going blindly won't help. Measure accurately and thoroughly (even stuff like shader compilation time might matter sometimes!). Then, even if you go by trial-and-error, there's a hope you'll get somewhere.
And to address one of your points in particular; whether it's one large VBO and a few smaller ones doesn't really matter, but a huge one might have problems in fitting in memory. You can still update parts of it, and what matters most is the memory arrangement inside of it.
What are some common guidelines in choosing vertex buffer type? When should we use interlaced buffers for vertex data, and when separate ones? When should we use an index array and when direct vertex data?
I'm searching for some common quidelines - I some cases where one or the opposite fits better, but not all cases are easily solvable. What should one have in mind choosing the vertex buffer format when aiming for performance?
Links to web resources on the topic are also welcome.
First of all, you can find some useful information on the OpenGL wiki. Second of all, if in doubt, profile, there are some rules-of-thumb about this one but experience can vary based on the data set, hardware, drivers, ... .
Indexed versus direct rendering
I would almost always by default use the indexed method for vertex buffers. The main reason for this is the so called post-transform cache. It's a cache kept after the vertex processing stage of your graphics pipeline. Essentially it means that if you use a vertex multiple times you have a good chance of hitting this cache and being able to skip the vertex computation. There is one condition to even hit this cache and that is that you need to use indexed buffers, it won't work without them as the index is a part of this cache's key.
Also, you likely will save storage, an index can be as small as you want (1 byte, 2 byte) and you can reuse a full vertex specification. Suppose that a vertex and all attributes total to about 30 bytes of data and you share this vertex over let's say 2 polygons. With indexed rendering (2 byte indices) this will cost you 2*index_size+attribute_size = 34 byte. With non-indexed rendering this will cost you 60 bytes. Often your vertices will be shared more than twice.
Is index-based rendering always better? No, there might be scenarios where it's worse. For very simple applications it might not be worth the code overhead to set up an index-based data model. Also, when your attributes are not shared over polygons (e.g. normal per-polygon instead of per-vertex) there is likely no vertex-sharing at all and IBO's won't give a benefit, only overhead.
Next to that, while it enables the post-transform cache, it does make generic memory cache performance worse. Because you access the attributes relatively random, you might have quite some more cache misses and memory prefetching (if this would be done on the GPU) won't work decently. So it might be (but measure) that if you have enough memory and your vertex shader is extremely simple that the non-indexed version outperforms the indexed version.
Interleaving vs non-interleaving vs buffer per-attribute
This story is a bit more subtle and I think it comes down to weighing some properties of your attributes.
Interleaved might be better because all attributes will be close together and likely be in a few memory cachelines (maybe even a single one). Obviously, this can mean better peformance. However, combined with indexed-based rendering your memory access is quite random anyway and the benefit might be smaller than you'd expect.
Know which attributes are static and which are dynamic. If you have 5 attributes of which 2 are completely static, 1 changes every 15 minutes and 2 every 10 seconds, consider putting them in 2 or 3 separate buffers. You don't want to re-upload all 5 attributes every time those 2 most frequent change.
Consider that attributes should be aligned on 4 bytes. So you might want to take interleaving even one step further from time to time. Suppose you have a vec3 1-byte attribute and some scalar 1-byte attribute, naively this will need 8 bytes. You might gain a lot by putting them together in a single vec4, which should reduce usage to 4 bytes.
Play with buffer size, a too large buffer or too many small buffers may impact performance. But this is likely very dependent on the hardware, driver and OpenGL implementation.
Indexed vs Direct
Let's see what you get by indexing. Every repeating vertex, that is, a vertex with "smooth" break will cost you less. Every singular "edge" vertex will cost you more. For data that's based on real world and is relatively dense, one vertex will belong to many triangles, and thus indexes will speed it up. For procedurally generated arbitrary data, direct mode will usually be better.
Indexed buffers also add additional complications to the code.
Interleaved vs Separate
The main difference here is actually based on a question "will I want to update only one component?". If the answer is yes, then you shouldn't interleave, because any update will be extremely costly. If it's no, using interleaved buffers should improve locality of reference and generally be faster on most of the hardware.