Imagine a typical game where objects in the simulated world are created and destroyed. When these objects are created, their vertex data is stored in a VBO. This VBO is rendered once per frame.
Is there a best practice for dealing with dead objects? I.e. when the object is destroyed and thus no longer needs to be rendered, what should happen to its corresponding VBO data?
It seems like you'd want to "free" that memory up for future use by other objects. Otherwise, your VBO would eventually be filled almost entirely with dead data.
I have one possible idea for implementing this: a map of VBO memory wherein individual bytes are marked as free or in use. (This map would live on the CPU as a normal array, not on the GPU.) When an object is created, we buffer its data to a free region as determined by the map. We mark that region as used on the map. Then when the object is destroyed, we mark that same region as free. I'm thinking you would store the map either as an array of booleans if you're lazy, or pack it in as one map bit per VBO byte if you want to do it right.
So far, does this sound like the best approach? Is there a more common approach that I'm not seeing?
I know a lot of these questions hinge on the characteristics of the scene you're rendering, so here's the context. My scene consists of several hundred objects. Each object has about eight vertices. Each vertex has a position and texture coordinate stored as floats. So, we're looking at approximately:
4 bytes per float * 6 floats per vert * 8 verts per object * 500 objects
= 96,000 bytes of vertex data
Sounds like you're thinking of using a pool allocator. There's a lot of existing work done on those, which should apply quite well to allocations inside a VBO also.
It will be pretty straightforward if all elements are the same size. Otherwise, you need to be concerned about fragmentation, but heap managers are quite well known.
The simplest improvement I would offer is to start your scan for a free slot from the last slot filled, instead of always from the beginning.
You can trade space for speed by using a deque-style data structure to store a list of free locations, which eliminates the need to scan for a free spot.
The size of the data stored in the VBO really has no impact on the manager. Only the number of slots which can be invididually repurposed.
Related
OpenGL has functions that can create and multiple objects at once
glCreateBuffers(1,&handle);
...
glDeleteBuffers(1,&handle);
I guess the intention is that it saves time to create/destroy all objects at once. But is that only at initialization, or does it affect the memory layout such as increased locality, resulting in shorter render time.
Before this question is marked as duplicate, of Is there an advantage to generating multiple indexes (names) at once in OpenGL?, this question is about object creation with ARB_direct_state_access, rather than name creation, which RetoKoradi comments is cheap.
The VRAM is not a disk, it's a Random Access Memory. OpenGL doesn't need to scan it through to reach something on it. There's usually a memory table stored somewhere, which keeps track of the locations of the buffers and vertex arrays. There's also a pointer that points to the current VAO, VBO, IBO, texture, framebuffer, etc.
When you use one of the glBindXYZ functions, you pass a handle to OpenGL. It then looks up the pointer to the bufffer/vertex array/texture/etc from the memory table, and sets the relevant pointer to the result.
The gen functions work like this, because most people know how many vbos or vaos they'll need, and this makes creating them easier. It's only a problem, when you have a variable amount of vbos.
Creating multiple buffers at once doesn't have an advantage over creating them one-by-one right after each other in terms of memory layout. Both
glCreateBuffers(1, &handle1);
// Stuff
glCreateBuffers(1, &handle2);
// Stuff
glCreateBuffers(1, &handle3);
// Stuff
and
glCreateBuffers(3, &handles);
// Stuff x 3
result in the same. In both cases the vbos are as close as possible (if they are already used up memory blocks where the next vbo should go, then it will be a bit further away from the others.
Are there any benefits of having separate vertex buffers for static and dynamic objects in a DirectX 11 application? My approach is to have the vertices of all objects in a scene stored in the same vertex buffer.
However, I will only have to re-map a small number of objects (1 to 5) of the whole collection (up to 200 objects). The majority of objects are static and will not be transformed in any way. What is the best approach for doing this?
Mapping a big vertex buffer with discard forces the driver to allocate new memory every frame. Up to ~4 frames can be in flight, so there can be 4 copies of that buffer. This can lead to memory overcommitment and stuttering. For example, ATI advises to discard vertex buffers up to 4 mb max (GCN Performance Tweets). Besides, every time you will have to needlessly copy static data to a new vertex buffer.
Mapping with no overwrite should work better. It would require to manually manage the memory, so you won't overwrite the data which is in flight. I'm not sure about the performance implications, but for sure this isn't a recommended path.
Best approach would be to simplify driver's work by providing as many hints as possible. Create static vertex buffers with immutable flag, long lived with default flag and dynamic with dynamic flag. See vendor guides like GCN Performance Tweets or Don’t Throw it all Away: Efficient Buffer Management for additional details.
I have a need to stream a texture (essentially a camera feed).
With object streaming, the following scenarios seem to be arise:
Is the new object's data store larger, smaller or same size as the old one?
Subset of or whole texture being updated?
Are we streaming a buffer object or texture object (any difference?)
Here are the following approaches I have come across:
Allocate object data store (either BufferData for buffers or TexImage2D for textures) and then each frame, update subset of data with BufferSubData or TexSubImage2D
Nullify/invalidate the object after the last call (eg. draw) that uses the object either with:
Nullify: glTexSubImage2D( ..., NULL), glBufferSubData( ..., NULL)
Invalidate: glBufferInvalidate(), glMapBufferRange with the GL_MAP_INVALIDATE_BUFFER_BIT, glDeleteTextures ?
Simpliy reinvoke BufferData or TexImage2D with the new data
Manually implement object multi-buffering / buffer ping-ponging.
Most immediately, my problem scenario is: entire texture being replaced with new one of same size. How do I implement this? Will (1) implicitly synchronize ? Does (2) avoid the synchronization? Will (3) synchronize or will a new data store for the object be allocated, where our update can be uploaded without waiting for all drawing using the old object state to finish? This passage from the Red Book V4.3 makes be believe so:
Data can also be copied between buffer objects using the
glCopyBufferSubData() function. Rather than assembling chunks of data
in one large buffer object using glBufferSubData(), it is possible to
upload the data into separate buffers using glBufferData() and then
copy from those buffers into the larger buffer using
glCopyBufferSubData(). Depending on the OpenGL implementation, it may
be able to overlap these copies because each time you call
glBufferData() on a buffer object, it invalidates whatever contents
may have been there before. Therefore, OpenGL can sometimes just
allocate a whole new data store for your data, even though a copy
operation from the previous store has not completed yet. It will then
release the old storage at a later opportunity.
But if so, why the need for (2)[nullify/invalidates]?
Also, please discuss the above approaches, and others, and their effectiveness for the various scenarios, while keeping in mind atleast the following issues:
Whether implicit synchronization to object (ie. synchronizing our update with OpenGL's usage) occurs
Memory usage
Speed
I've read http://www.opengl.org/wiki/Buffer_Object_Streaming but it doesn't offer conclusive information.
Let me try to answer at least a few of the questions you raised.
The scenarios you talk about can have a great impact on the performance on the different approaches, especially when considering the first point about the dynamic size of the buffer. In your scenario of video streaming, the size will rarely change, so a more expensive "re-configuration" of the data structures you use might be possible. If the size changes every frame or every few frames, this is typically not feasable. However, if a resonable maximum size limit can be enforced, just using buffers/textures with the maximum size might be a good strategy. Neither with buffers nor with textures you have to use all the space there is (although there are some smaller issues when you do this with texures, like wrap modes).
3.Are we streaming a buffer object or texture object (any difference?)
Well, the only way to efficiently stream image data to or from the GL is to use pixel buffer objects (PBOs). So you always have to deal with buffer objects in the first place, no matter if vertex data, image data or whatever data is to be tranfered. The buffer is just the source for some glTex*Image() call in the texture case, and of course you'll need a texture object for that.
Let's come to your approaches:
In approach (1), you use the "Sub" variant of the update commands. In that case, (parts of or the whole) storage of the existing object is updated. This is likely to trigger an implicit synchronziation ifold data is still in use. The GL has basically only two options: wait for all operations (potentially) depending on that data to complete, or make an intermediate copy of the new data and let the client go on. Both options are not good from a performance point of view.
In approach (2), you have some misconception. The "Sub" variants of the update commands will never invalidate/orphan your buffers. The "non-sub" glBufferData() will create a completely new storage for the object, and using it with NULL as data pointer will leave that storage unintialized. Internally, the GL implementation might re-use some memory which was in use for earlier buffer storage. So if you do this scheme, there is some probablity that you effectively end up using a ring-buffer of the same memory areas if you always use the same buffer size.
The other methods for invalidation you mentiond allow you to also invalidate parts of the buffer and also a more fine-grained control of what is happening.
Approach (3) is basically the same as (2) with the glBufferData() oprhaning, but you just specify the new data directly at this stage.
Approach (4) is the one I actually would recommend, as it is the one which gives the application the most control over what is happening, without having to relies on the GL implementation's specific internal workings.
Without taking synchronization into account, the "sub" variant of the update commands is
more efficient, even if the whole data storage is to be changed, not just some part. That is because the "non-sub" variants of the commands basically recreate the storage and introduce some overhead with this. With manually managing the ring buffers, you can avoid any of that overhead, and you don't have to rely in the GL to be clever, by just using the "sub" variants of the updates functions. At the same time, you can avoid implicit synchroniztion by only updating buffers which aren't in use by th GL any more. This scheme can also nicely be extenden into a multi-threaded scenario. You can have one (or several) extra threads with separate (but shared) GL contexts to fill the buffers for you, and just passing the buffer handlings to the draw thread as soon as the update is complete. You can also just map the buffers in the draw thread and let the be filled by worker threads (wihtout the need for additional GL contexts at all).
OpenGL 4.4 introduced GL_ARB_buffer_storage and with it came the GL_MAP_PERSISTEN_BIT for glMapBufferRange. That will allow you to keep all of the buffers mapped while they are used by the GL - so it allows you to avoid the overhead of mapping the buffers into the address space again and again. You then will have no implicit synchronzation at all - but you have to synchronize the operations manually. OpenGL's synchronization objects (see GL_ARB_sync) might help you with that, but the main burden on synchronization is on your applications logic itself. When streaming videos to the GL, just avoid re-using the buffer which was the source for the glTexSubImage() call immediately and try to delay its re-use as long as possible. You are of course also trading throughput for latency. If you need to minimize latency, you might to have to tweak this logic a bit.
Comparing the approaches for "memory usage" is really hard. There are a lot of of implementation specific details to consider here. A GL implementation might keep some old buffer memories around for some time to fullfill recreation requests of the same size. Also, an GL implementation might make shadow copies of any data at any time. The approaches which don't orphan and recreate storages all the time in principle expose more control of the memory which is in use.
"Speed" itself is also not a very useful metric. You basically have to balance throughput and latency here, according to the requirements of your application.
I have a couple questions about how OpenGL handles these drawing operations.
So lets say I pass OpenGL the pointer to my vertex array. Then I can call glDrawElements with an array of indexes. It will draw the requested shapes using those indexes in the vertex array correct?
After that glDrawElements call could I then do another glDawElements call with another set of indexes? Would it then draw the new index array using the original vertex array?
Does OpenGL keep my vertex data around for the next frame when I redo all of these calls? So the the next vertex pointer call would be a lot quicker?
Assuming the answer to the last three questions is yes, What if I want to do this on multiple vertex arrays every frame? I'm assuming doing this on any more than 1 vertex array would cause OpenGL to drop the last used array from graphics memory and start using the new one. But in my case the vertex arrays are never going to change. So what I want to know is does opengl keep my vertex arrays around in-case next time I send it vertex data it will be the same data? If not is there a way I can optimize this to allow something like this? Basically I want to draw procedurally between the vertexes using indicies without updating the vertex data, in order to reduce overhead and speed up complicated rendering that requires constant procedurally changing shapes that will always use the vertexes from the original vertex array. Is this possible or am I just fantasizing?
If I'm just fantasizing about my fourth question what are some good fast ways of drawing a whole lot of polygons each frame where only a few will change? Do I always have to pass in a totally new set of vertex data for even small changes? Does it already do this anyways when the vertex data doesn't change because I notice I cant really get around the vertex pointer call each frame.
Feel free to totally slam any logic errors I've made in my assertions. I'm trying to learn everything I can about how opengl works and it's entirely possible my current assumptions on how it works are all wrong.
1.So lets say I pass OpenGL the pointer to my vertex array. Then I can call glDrawElements with an array of indexes. It will draw the
requested shapes using those indexes in the vertex array correct?
Yes.
2.After that glDrawElements call could I then do another glDawElements
call with another set of indexes? Would it then draw the new index
array using the original vertex array?
Yes.
3.Does OpenGL keep my vertex data around for the next frame when I redo
all of these calls? So the the next vertex pointer call would be a lot
quicker?
Answering that is a bit more tricky than you might. The way you ask these questions makes me to assume that uou use client-side vertex arrays, that is, you have some arrays in your system memory and let your vertes pointers point directly to those. In that case, the answer is no. The GL cannot "cache" that data in any useful way. After the draw call is finished, it must assume that you might change the data, and it would have to compare every single bit to make sure you have not changed anything.
However, client side VAs are not the only way to have VAs in the GL - actually, they are completely outdated, deprecated since GL3.0 and been removed from modern versions of OpenGL. The modern way of doing thins is using Vertex Buffer Objects, which basically are buffers which are managed by the GL, but manipulated by the user. Buffer objects are just a chunk of memory, but you will need special GL calls to create them, read or write or change data and so on. And the buffer object might very well not be stored in system memory, but directly in VRAM, which is very useful for static data which is used over and over again. Have a look at the GL_ARB_vertex_buffer_object extension spec, which orignially introduced that feature in 2003 and became core in GL 1.5.
4.Assuming the answer to the last three questions is yes, What if I want
to do this on multiple vertex arrays every frame? I'm assuming doing
this on any more than 1 vertex array would cause OpenGL to drop the
last used array from graphics memory and start using the new one. But
in my case the vertex arrays are never going to change. So what I want
to know is does opengl keep my vertex arrays around in-case next time
I send it vertex data it will be the same data? If not is there a way
I can optimize this to allow something like this? Basically I want to
draw procedurally between the vertexes using indicies without updating
the vertex data, in order to reduce overhead and speed up complicated
rendering that requires constant procedurally changing shapes that
will always use the vertexes from the original vertex array. Is this
possible or am I just fantasizing?
VBOs are exactly what you are looking for, here.
5.If I'm just fantasizing about my fourth question what are some good
fast ways of drawing a whole lot of polygons each frame where only a
few will change? Do I always have to pass in a totally new set of
vertex data for even small changes? Does it already do this anyways
when the vertex data doesn't change because I notice I cant really get
around the vertex pointer call each frame.
You can also update just parts of a VBO. However, it might become inefficient if you have many small parts which are randomliy distributed in your buffer, it will be more efficient to update continous (sub-)regions. But that is a topic on it's own.
Yes
Yes
No. As soon as you create a Vertex Buffer Object (VBO) it will stay in the GPU memory. Otherwise vector data needs to be re-transferred (an old method of avoiding this was Display Lists). In both cases the performance of subsequent frames should stay similar (but much better with the VBO method): you can do the VBO creation and download before rendering the first frame.
The VBO was introduced to provide you exactly with this functionality. Just create several VBOs. Things get messy when you need more GPU memory than available though.
VBO is still the answer, and see Modifying only a specific element type of VBO buffer data?
It sounds like you should try something called Vertex Buffer Objects. It offers the same benefits as Vertex Arrays, but you can create multiple vertex buffers and store them in "named slots". This method has much better performance as data is stored directly in Graphic Card memory.
Here is a good tutorial in C++ to start with.
In this question I'm interested in buffer-drawing in OpenGL, specifically in the tradeoff of using one buffer per data set vs one buffer for more than one data set.
Context:
Consider a data set of N vertices each represented by a set of attributes (e.g. color, texture, normals).
Each attribute is represented by a type (e.g. GLfloat, GLint) and a number of components (2, 3, 4). We want to draw this data. Schematically,
(non-interleaved representation)
data set
<-------------->
a_1 a_2 a_3
<---><---><---->
a_i = attribute; e.g. a2 = (3 GLfloats representing color, thus 3*N Glfloats)
We want to map this into the GL state, using glBufferSubData.
Problem
When mapping, we have to keep track of the data in our memory because glBufferSubData requires a start and size. This sounds to me like an allocation problem: we want to allocate memory and keep track of its position. Since we want fast access to it, we would like the data to be in the same memory position, e.g. with a std::vector<char>. Schematically,
data set 1 data set 2
<------------><-------------->
(both have same buffer id)
We commit to the gl state as:
// id is binded to one std::vector<char>, "data".
glBindBuffer(target, id);
// for each data_set (AFTER calling glBindBuffer).
// for each attribute
// "start": the start point of the attribute.
// "size": (sizeof*components of the attribute)*N.
glBufferSubData(target, start, size, &(data[0]))
(non non-interleaved for the sake of the code).
the problem arises when we want to add or remove vertices, e.g. when LOD changes. Because each data set must be a chunk, for instance to allow interleaved drawing (even in non-interleaved, each attribute is a chunk), we will end up with fragmentation in our std::vector<char>.
On the other hand, we can also set one chunk per buffer: instead of assigning chunks to the same buffer, we assign each chuck, now a std::vector<char>, to a different buffer. Schematically,
data set 1 (buffer id1)
<------------>
data set 2 (buffer id2)
<-------------->
We commit data to the gl state as:
// for each data_set (BEFORE calling glBindBuffer).
// "data" is the std::vector<char> of this data_set.
// id is now binded to the specific std::vector<char>
glBindBuffer(target, id);
// for each attribute
// "start": the start point of the attribute.
// "size": (sizeof*components of the attribute)*N.
glBufferSubData(target, start, size, &(data[0]))
Questions
I'm learning this, so, before any of the below: is this reasoning correct?
Assuming yes,
Is it a problem to have an arbitrary number of buffers?
Is "glBindBuffer" expected to scale with the number of buffers?
What are the major points to take into consideration in this decision?
It is not quite clear if you asking about performance trade-offs. But I will answer in this key.
Is it a problem to have an arbitrary number of buffers?
It is a problem came from a dark medieval times when pipelines was fixed and rest for now due to backward compatibility reasons. glBind* is considered as a (one of) performance bottleneck in modern OpenGL drivers, caused by bad locality of references and cache misses. Simply speaking, cache is cold and huge part of time CPU just waits in driver for data transferred from main memory. There is nothing drivers implementers can do with current API. Read Nvidia's short article about it and their bindless extensions proposals.
2. Is "glBindBuffer" expected to scale with the number of buffers?
Surely, the more objects (buffers in your case), more bind calls, more performance loss in driver. But merged, huge resource objects are less manageable.
3. What are the major points to take into consideration in this decision?
Only one. Profiling results ;)
"Premature optimization is the root of all evil", so try to stay as much objective as possible and believe only in numbers. When numbers will go bad, we can think of:
"Huge", "all in one" resources:
less bind calls
less context changes
harder to manage and debug, need some additional code infrastructure (to update resource data for example)
resizing (reallocation) very slow
Separate resources:
more bind calls, loosing time in driver
more context changes
easier to manage, less error-prone
easy to resize, allocate, reallocate
In the end, we can see have performance-complexity trade-off and different behavior when update data. To stick one approach or another, you must:
decide, would you like to keep things simple, manageable or add complexity and gain additional FPS (profile in graphics profilers to know how much. Does it worth it?)
know how often you resize/reallocate buffers (trace API calls in graphics debuggers).
Hope it helps somehow ;)
If you like theoretical assertions like this, probably you will be interested in another one, about interleaving (DirectX one)