this is my first question, so if you have any suggestions on how to improve the question, feel free to tell me :)
So my problem is this: I have an object that changes each frame, and this results in a varying number of vertices that I plan to send through my pipeline. Now, I obviously can't change the size of my vertex buffer on the fly, so what is the best way to approach this?
Here are some of the ideas I had:
Make a vertex buffer of size n, and simply pass the vertices in bunches of n. Downsides: some vertices go through the VS multiple times, so a small decrease in performance (probably not noticeable), having to call Map/Unmap several times per frame, which could be problematic.
Make a huge vertex buffer that will hold enough vertices that my object will never exceed the size of the buffer. Downsides: since I don't know in advance how many vertices my object will have, it might be hard to predict the size of this buffer. In addition, the buffer might take up too much space as a result.
Each frame, create a new vertex buffer of the correct size, and call IASetVertexBuffer. Downsides: This will probably have a huge performance issue...
I'd appreciate any ideas your guys have, or suggestions on which one of these three to use! :)
1/This is not ideal, since as you said you need to map/unmap several times, and do draw calls in between. You do a trade off to lose performance for less memory usage, on a modern card (count in gigabytes of data), memory is unlikely to be an issue (as mentioned in comment).
2/As opposite of 1, you need a single Map/Unmap, and a single draw (you can also specify vertices count in DeviceContext->Draw to make sure you only draw relevant part of your buffer). This will be your best choice performance wise, and it should not be too hard to have some form of maximum defined (even 1 million poly is not so much memory, and you'll have quite a hard time to have your CPU feeding that amount of data every frame).
3/I don't see any good benefit in your use case (recreating resources is common when you do async load for some immutable resources, so it doesn't really apply here).
So go for 2, if one day memory becomes an issue it's quite easy to move back to 1, but I doubt this will ever happen.
Let's imagine that I have two squeres. Firstly I generate the VAO, VBO, then bind it and so on... My goal is to check the collision between the two objects in every frame. In this case, I have to know the exact verticies both the cpu and the gpu side. So I store every single vertex twice. If I work with large amount of data, the mirroring not seems to be efficient, not to mention the logistic about keeping the data consistent. Is there a better way to do this? Or this is totally OK, to keep the verticies in an array after the glBufferData call?
There needs to be more information on your part. Is this something you plan on instancing? Are you sure there's a bottleneck on your bandwidth?
If this is something that you can simulate on the GPU, then just do it all on the GPU so you keep the memory on that side and not incur the transfer penalty from CPU to GPU.
If you need it to be on the CPU side for your collision detection, then you have a few options:
Update the ones that change. If all of them are changing you should ignore this option, but you could map the buffer and update it and try to flush it only after you've updated what ranges are needed.
Send a displacement. If you end up having a ton of data, you may be able to get away with just sending a rotation and central position to cut down on updating "every vertex", and might be able to exploit a Geometry Shader... however I've read that these can be problematic for performance so you should consider it but be ready to profile.
You can possibly stream data if you have to update all of them, see this wonderful resource.
You need to define your problem domain a bit more because I'm not sure exactly what the bounds on the problem are. The above are some ways of tackling these problems, but the best solution can only be given to you if you are more specific with what you want when you talk about the large case.
You also have to understand that asking for massive data manipulation and fast transfer tend to be topics that fight each other, and you'll have to be smarter about what you plan on doing depending on exactly how much data you are talking about here.
I'd like to answer with something more concrete but I'm just shooting in the dark because I don't know exactly what the limit of your data is and what hardware you're working with.
I need to know how I can render many different 3D models, which change their geometry to each frame (are animated models), don't repeat models and textures.
I carry all models and for each created an "object" model class.
What is the most optimal way to render them?
To use 1 VBO for each 3D model
To use a single VBO for all models (to be all different, I do not see this option possible)
I work with OpenGL 3.x or higher, C++ on Windows.
TL; DR - there's no silver bullet when it comes to rendering performance
Why is that? That depends on the complicated process that gets your data, converts it, pushes it to GPU and then makes pixels on the screen flicker. So, instead of "one best way", a few of guideliness appeared that might usually improve the performance.
Keep all the necessary data on the GPU (because the closer to the screen, the shorter way electrons have to go :))
Send as little data to GPU between frames as possible
Don't sync needlessly between CPU and GPU (that's like trying to run two high speed trains on parallel tracks, but insisting on slowing them down to the point where you can pass something through the window every once in a while),
Now, it's obvious that if you want to have a model that will change, you can't have the cake and eat it. You have to made tradeoffs. Simply put, dynamic objects will never render as fast as static ones. So, what should you do?
Hint GPU about the data usage (GL_STREAM_DRAW or GL_DYNAMIC_DRAW) - that should guarantee optimal memory arrangement.
Don't use interleaved buffers to mix static vertex attributes with dynamic ones - if you divide the memory, you can batch-update the geometry leaving texture coordinates intact, for example.
Try to do as much as you can purely on the GPU - with compute shaders and transform feedback, it might well be possible to store whole animation data as a buffer itself and calculate it on GPU, avoiding expensive syncs.
And last but not least, always carefully measure the impact of your change on performance. Going blindly won't help. Measure accurately and thoroughly (even stuff like shader compilation time might matter sometimes!). Then, even if you go by trial-and-error, there's a hope you'll get somewhere.
And to address one of your points in particular; whether it's one large VBO and a few smaller ones doesn't really matter, but a huge one might have problems in fitting in memory. You can still update parts of it, and what matters most is the memory arrangement inside of it.
This question comes in two (mostly) independent parts
My current setup is that I have a lot of Objects in gamespace. Each has a VBO assigned to it, which holds Vertex Attribute data for each vertex. If the Object wants to change its vertex data (position etc) it does so in an internal array and then call glBufferSubDataARB to update the version in the GPU.
Now I understand that this is a horrible thing to do and so I am looking for alternatives. One that presents itself is to have some managing thing that has a large VBO in the beginning and Objects can request space from it, and edit points in it. This drops the overhead of loading VBOs but comes with a large energy/time expenditure in creating and debugging such a beast (basically an entire memory management system).
My question (part (a)) is if this is the "best" method for doing this, or if there is something better that I have not thought of.
Such a system should allow easy addition/removal of vertices and editing them, as fast as possible.
Part (b) is about some simple actions taken on every object, ie those of rotation and translation. At the moment I am moving each vertex (ouch), but this must have a better option. I am considering uploading rotation and translation matrices to my shader to do there. This seems fine, but I am slightly worried about the overhead of changing uniform variables. Would it ultimately be to my advantage to do this? How fast is changing uniform variables?
Last time I checked the preferred way to do buffer updates was orphaning.
Basically, whenever you want to update your buffered data, you call glBindBuffer on your buffer, which invalidates the current content of the buffer, and then you write your new data with glMapBuffer / glBufferSubdata.
Use a single big VBO for your static data is indeed a good idea. You must take care of the maximum allowed VBO size, and split your static data into multiple VBOs if necessary. But this is probably an over-optimization in most cases (i.e. "I wouldn't bother").
Data which is updated frequently should be grouped in the same VBO (with usage = GL_STREAM_DRAW), and you shall use orphaning to update that.
Unfortunately, the actual performance of this stuff varies on different implementations. This guy made some tests on an actual game, it may be worth reading.
For the second part of your question, obviously using uniforms is the way to do it. Yes there is some (little) overhead, but it's sure 1000 times better than streaming all your data at every frame.
What are some common guidelines in choosing vertex buffer type? When should we use interlaced buffers for vertex data, and when separate ones? When should we use an index array and when direct vertex data?
I'm searching for some common quidelines - I some cases where one or the opposite fits better, but not all cases are easily solvable. What should one have in mind choosing the vertex buffer format when aiming for performance?
Links to web resources on the topic are also welcome.
First of all, you can find some useful information on the OpenGL wiki. Second of all, if in doubt, profile, there are some rules-of-thumb about this one but experience can vary based on the data set, hardware, drivers, ... .
Indexed versus direct rendering
I would almost always by default use the indexed method for vertex buffers. The main reason for this is the so called post-transform cache. It's a cache kept after the vertex processing stage of your graphics pipeline. Essentially it means that if you use a vertex multiple times you have a good chance of hitting this cache and being able to skip the vertex computation. There is one condition to even hit this cache and that is that you need to use indexed buffers, it won't work without them as the index is a part of this cache's key.
Also, you likely will save storage, an index can be as small as you want (1 byte, 2 byte) and you can reuse a full vertex specification. Suppose that a vertex and all attributes total to about 30 bytes of data and you share this vertex over let's say 2 polygons. With indexed rendering (2 byte indices) this will cost you 2*index_size+attribute_size = 34 byte. With non-indexed rendering this will cost you 60 bytes. Often your vertices will be shared more than twice.
Is index-based rendering always better? No, there might be scenarios where it's worse. For very simple applications it might not be worth the code overhead to set up an index-based data model. Also, when your attributes are not shared over polygons (e.g. normal per-polygon instead of per-vertex) there is likely no vertex-sharing at all and IBO's won't give a benefit, only overhead.
Next to that, while it enables the post-transform cache, it does make generic memory cache performance worse. Because you access the attributes relatively random, you might have quite some more cache misses and memory prefetching (if this would be done on the GPU) won't work decently. So it might be (but measure) that if you have enough memory and your vertex shader is extremely simple that the non-indexed version outperforms the indexed version.
Interleaving vs non-interleaving vs buffer per-attribute
This story is a bit more subtle and I think it comes down to weighing some properties of your attributes.
Interleaved might be better because all attributes will be close together and likely be in a few memory cachelines (maybe even a single one). Obviously, this can mean better peformance. However, combined with indexed-based rendering your memory access is quite random anyway and the benefit might be smaller than you'd expect.
Know which attributes are static and which are dynamic. If you have 5 attributes of which 2 are completely static, 1 changes every 15 minutes and 2 every 10 seconds, consider putting them in 2 or 3 separate buffers. You don't want to re-upload all 5 attributes every time those 2 most frequent change.
Consider that attributes should be aligned on 4 bytes. So you might want to take interleaving even one step further from time to time. Suppose you have a vec3 1-byte attribute and some scalar 1-byte attribute, naively this will need 8 bytes. You might gain a lot by putting them together in a single vec4, which should reduce usage to 4 bytes.
Play with buffer size, a too large buffer or too many small buffers may impact performance. But this is likely very dependent on the hardware, driver and OpenGL implementation.
Indexed vs Direct
Let's see what you get by indexing. Every repeating vertex, that is, a vertex with "smooth" break will cost you less. Every singular "edge" vertex will cost you more. For data that's based on real world and is relatively dense, one vertex will belong to many triangles, and thus indexes will speed it up. For procedurally generated arbitrary data, direct mode will usually be better.
Indexed buffers also add additional complications to the code.
Interleaved vs Separate
The main difference here is actually based on a question "will I want to update only one component?". If the answer is yes, then you shouldn't interleave, because any update will be extremely costly. If it's no, using interleaved buffers should improve locality of reference and generally be faster on most of the hardware.
The title says everything, but just to be clear I'll add some extra words.
In this case, resize means:
getting more storage space at the end of the old vbo
saving the old data at the front
(hopefully not copying, but at least not on CPU side, meaning the driver should handle this)
As to explain some more details and justify my question:
I will store data of (in forehand) unknown size to the VBO but I only know an upper limit that is a very rough estimation (10 - 100x as much or even more in unusual conditions).
Of course I know how much data I stored, when I am done with it, so it would be nice to store data until I find my VBO too small and resize it and then go on storing.
Here is why I don't want to copy(especially not on CPU side):
I am doing all this on the GPU to get interactive frame rates. When I have to copy it is very slow or even not possible, because there is not enough space. Worst of all is to copy the data over the CPU, hence passing everything over the bus, into a new memory region that has sufficient size, then glBufferDataing the VBO with new size and the new memory region as source. That would be the performance killer.
I circumvented the problem with an exact estimation of the needed space. But I will let this question be unanswered for a week to see if someone has another hint on this as I am not very happy with the solution.
I think without doing a copy you won't get around this, because the only way to resize a buffer is to call glBufferData and there is IMO no way to tell the driver to keep the old data.
What you probably can do to at least not copy it to the CPU and back again, is creating some kind of auxiliary VBO for these purposes and copy directly from the VBO into the auxiliary VBO (using the ARB_copy_buffer extension), resize the VBO and copy its contents back.
But I think the best way is just to allocate a larger buffer beforehand, so the resize is not neccessary, but of course in this case you need to know approximately how much extra storage you need.
Revisiting this question after some years, the landscape has changed a bit with newer versions and extensions.
GPU Side Copying
The extension mentioned in Christian Rau's answer is core since 3.1 which allows us to copy contents (via glCopyBufferSubData) from one VBO to another. Hopefully, the driver does this on the GPU side!
Using this function we could create a larger buffer and copy the leading data over. This has the disadvantage of doubling the memory requirements because we need both buffers.
True resizing
The good news is: With sparse buffers an even better solution is on the horizon.
Given this extension we can allocate a virtual buffer with more than enough space for our data without ever paying for the unneeded space. We only have to "commit" the regions of memory we physically want to store data in. This means we can "grow" the VBO by committing new pages at the end of it.
The bad news is: As of the current OpenGL version (4.5) this is still an extension and not yet core, so adopting this might not be possible. You should also not that there are some details in the specification that are not yet worked out. For example, mapping of sparse buffers is disallowed in the current extension but support might be added in future versions.
I would be keen to hear about the availability of this extension if you have any data on that.
Assuming you have support for a recent OpenGL standard, an alternative to VBOs might be to store your data in textures ( again, assuming you have enough memory on your card ). Copying data between old and new textures would take place on the card and not affect the data transfer.
Exactly how you achieve this depends on exactly what your code is doing. But in principle, you use texture data to overwrite dummy vertex data in your drawing calls, or maybe use instancing. It would require a lot of thought and rework.