Buffer drawing in OpenGL - opengl

In this question I'm interested in buffer-drawing in OpenGL, specifically in the tradeoff of using one buffer per data set vs one buffer for more than one data set.
Context:
Consider a data set of N vertices each represented by a set of attributes (e.g. color, texture, normals).
Each attribute is represented by a type (e.g. GLfloat, GLint) and a number of components (2, 3, 4). We want to draw this data. Schematically,
(non-interleaved representation)
data set
<-------------->
a_1 a_2 a_3
<---><---><---->
a_i = attribute; e.g. a2 = (3 GLfloats representing color, thus 3*N Glfloats)
We want to map this into the GL state, using glBufferSubData.
Problem
When mapping, we have to keep track of the data in our memory because glBufferSubData requires a start and size. This sounds to me like an allocation problem: we want to allocate memory and keep track of its position. Since we want fast access to it, we would like the data to be in the same memory position, e.g. with a std::vector<char>. Schematically,
data set 1 data set 2
<------------><-------------->
(both have same buffer id)
We commit to the gl state as:
// id is binded to one std::vector<char>, "data".
glBindBuffer(target, id);
// for each data_set (AFTER calling glBindBuffer).
// for each attribute
// "start": the start point of the attribute.
// "size": (sizeof*components of the attribute)*N.
glBufferSubData(target, start, size, &(data[0]))
(non non-interleaved for the sake of the code).
the problem arises when we want to add or remove vertices, e.g. when LOD changes. Because each data set must be a chunk, for instance to allow interleaved drawing (even in non-interleaved, each attribute is a chunk), we will end up with fragmentation in our std::vector<char>.
On the other hand, we can also set one chunk per buffer: instead of assigning chunks to the same buffer, we assign each chuck, now a std::vector<char>, to a different buffer. Schematically,
data set 1 (buffer id1)
<------------>
data set 2 (buffer id2)
<-------------->
We commit data to the gl state as:
// for each data_set (BEFORE calling glBindBuffer).
// "data" is the std::vector<char> of this data_set.
// id is now binded to the specific std::vector<char>
glBindBuffer(target, id);
// for each attribute
// "start": the start point of the attribute.
// "size": (sizeof*components of the attribute)*N.
glBufferSubData(target, start, size, &(data[0]))
Questions
I'm learning this, so, before any of the below: is this reasoning correct?
Assuming yes,
Is it a problem to have an arbitrary number of buffers?
Is "glBindBuffer" expected to scale with the number of buffers?
What are the major points to take into consideration in this decision?

It is not quite clear if you asking about performance trade-offs. But I will answer in this key.
Is it a problem to have an arbitrary number of buffers?
It is a problem came from a dark medieval times when pipelines was fixed and rest for now due to backward compatibility reasons. glBind* is considered as a (one of) performance bottleneck in modern OpenGL drivers, caused by bad locality of references and cache misses. Simply speaking, cache is cold and huge part of time CPU just waits in driver for data transferred from main memory. There is nothing drivers implementers can do with current API. Read Nvidia's short article about it and their bindless extensions proposals.
2. Is "glBindBuffer" expected to scale with the number of buffers?
Surely, the more objects (buffers in your case), more bind calls, more performance loss in driver. But merged, huge resource objects are less manageable.
3. What are the major points to take into consideration in this decision?
Only one. Profiling results ;)
"Premature optimization is the root of all evil", so try to stay as much objective as possible and believe only in numbers. When numbers will go bad, we can think of:
"Huge", "all in one" resources:
less bind calls
less context changes
harder to manage and debug, need some additional code infrastructure (to update resource data for example)
resizing (reallocation) very slow
Separate resources:
more bind calls, loosing time in driver
more context changes
easier to manage, less error-prone
easy to resize, allocate, reallocate
In the end, we can see have performance-complexity trade-off and different behavior when update data. To stick one approach or another, you must:
decide, would you like to keep things simple, manageable or add complexity and gain additional FPS (profile in graphics profilers to know how much. Does it worth it?)
know how often you resize/reallocate buffers (trace API calls in graphics debuggers).
Hope it helps somehow ;)
If you like theoretical assertions like this, probably you will be interested in another one, about interleaving (DirectX one)

Related

Reuse vertex attribute buffer as index buffer?

Can I use a VBO which I initialise like this:
GLuint bufferID;
glGenBuffers(1,&BufferID);
glBindBuffer(GL_ARRAY_BUFFER,bufferID);
glBufferData(GL_ARRAY_BUFFER,nBytes,indexData,GL_DYNAMIC_DRAW);
as an index buffer, like this:
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER,bufferID);
/* ... set up vertex attributes, NOT using bufferID in the process ... */
glDrawElements(...);
I would like to use the buffer mostly as an attribute buffer and occasionally as an index buffer (but never at the same time).
There is nothing in the GL which prevents you from doing such things, your code above is legal GL. You can bind every buffer to every buffer binding target (you can even bind the same buffer to different targets at the same time, so it is even OK if attributes and index data come from the same buffer). However, the GL implementation might do some optimizations based on the observed behavior of the application, so you might end up with sub-optimal performance if you suddenly change the usage of an existing buffer object with such an approach, or use it for two things at once.
Update
The ARB_vertex_buffer_object extension spec, which introduced the concept of buffer objects to OpenGL, mentions this topic in the "Issues" section:
Should this extension include support for allowing vertex indices to be stored in buffer objects?
RESOLVED: YES. It is easily and cleanly added with just the
addition of a binding point for the index buffer object. Since
our approach of overloading pointers works for any pointer in GL,
no additional APIs need be defined, unlike in the various
*_element_array extensions.
Note that it is expected that implementations may have different
memory type requirements for efficient storage of indices and
vertices. For example, some systems may prefer indices in AGP
memory and vertices in video memory, or vice versa; or, on
systems where DMA of index data is not supported, index data must
be stored in (cacheable) system memory for acceptable
performance. As a result, applications are strongly urged to
put their models' vertex and index data in separate buffers, to
assist drivers in choosing the most efficient locations.
The reasoning that some implementations might prefer to keep index buffers in system RAM seems quite outdated, though.
While completely legal, it's sometimes discouraged to have attribute data and index data in the same buffer. I suspect that this is mostly based on a paragraph in the spec document (e.g. page 49 of the OpenGL 3.3 spec, at the end of the section "2.9.7 Array Indices in Buffer Objects"):
In some cases performance will be optimized by storing indices and array data in separate buffer objects, and by creating those buffer objects with the corresponding binding points.
While it seems plausible that it could be harmful to performance, I would be very interested to see benchmark results on actual platforms showing it. Attribute data and index data are used at the same time, and with the same access operations (CPU write, or blit from temporary storage, for filling the buffer with data, GPU read during rendering). So I can't think of a very good reason why they would need to be treated differently.
The only difference I can think of is that the index data is always read sequentially, while the attribute data is read out of order during indexed rendering. So it might be possible to apply different caching attributes for performance tuning the access in both cases.

Using separate vertex buffer for dynamic and static objects in DirectX11

Are there any benefits of having separate vertex buffers for static and dynamic objects in a DirectX 11 application? My approach is to have the vertices of all objects in a scene stored in the same vertex buffer.
However, I will only have to re-map a small number of objects (1 to 5) of the whole collection (up to 200 objects). The majority of objects are static and will not be transformed in any way. What is the best approach for doing this?
Mapping a big vertex buffer with discard forces the driver to allocate new memory every frame. Up to ~4 frames can be in flight, so there can be 4 copies of that buffer. This can lead to memory overcommitment and stuttering. For example, ATI advises to discard vertex buffers up to 4 mb max (GCN Performance Tweets). Besides, every time you will have to needlessly copy static data to a new vertex buffer.
Mapping with no overwrite should work better. It would require to manually manage the memory, so you won't overwrite the data which is in flight. I'm not sure about the performance implications, but for sure this isn't a recommended path.
Best approach would be to simplify driver's work by providing as many hints as possible. Create static vertex buffers with immutable flag, long lived with default flag and dynamic with dynamic flag. See vendor guides like GCN Performance Tweets or Don’t Throw it all Away: Efficient Buffer Management for additional details.

glMapBufferRange() downloads full buffer?

I noticed a 15ms slow down when calling some openGL functions. After some tests I do believe I narrowed down the issue. I do have a buffer of couple MBytes (containing mostly particles). I do need to add some particles sometimes. To do so I bind the buffer, get the current number of particles to know the offset whereto write, then write the particles. As expected, the slow down is on the reading part. (For this problem, do assume that keeping track of the number of particles on the CPU side is impossible.)
glBindBuffer(GL_ARRAY_BUFFER, m_buffer);
GLvoid* rangePtr = glMapBufferRange( //This function takes 15ms to return
GL_ARRAY_BUFFER,
m_offsetToCounter,
sizeof(GLuint),
1);
if(rangePtr != NULL)
value = *(GLuint*) rangePtr;
m_functions->glBindBuffer(GL_ARRAY_BUFFER, 0);
I assumed by providing a really limited size (here a GLuint), only a GLuint would be downloaded. However, by reducing drastically the size of my buffer to 200 KBytes, the execution time of the function drops to 8ms.
Two question :
glMapBufferRange as well as glGetBufferSubData do download the full buffer even though the user only ask for a portion of it ?
The math doesn't add up, any idea why ? There is still 8ms to download a really small buffer. The execution time equation looks like y = ax + b where b is 7-8 ms. When I was trying to find the source of the problem before suspecting the buffer size, I also found that glUniform* functions took ~10ms as well. But only the first call. If there is multiple glUniform* calls one after the other, only the first one takes a lot of time. The others are instantaneous. And when the buffer will be accessed in reading, there is no download time as well. Is glUniform* triggering something ?
I'm using the Qt 5 API. I would like to be sure first that I'm using openGL properly before thinking it might be Qt's layer that causes the slow down and re-implement the whole program with glu/glut.
8ms sounds like an awful lot of time… How do you measure that time?
glMapBufferRange as well as glGetBufferSubData do download the full buffer even though the user only ask for a portion of it?
The OpenGL specification does not define in which way buffer mapping is to be implemented by the implementation. It may be a full download of the buffers contents. It may be a single page I/O-Memory mapping. It may be anything the makes the contents of the buffer object appear in the host process address space.
The math doesn't add up, any idea why?
For one thing the smallest size of a memory map is the system's page size. Either if it's done by a full object copy or by a I/O-Memory mapping or something entirely different, you're always dealing with memory chunks at least a few kiB in size.
I'm using the Qt 5 API
Could it be that you're using the Qt5 OpenGL functions class? AFAIK this class does load function pointers on demand, so the first invocation of a function may trigger a chain of actions that take a few moments to complete.

OpenGL Texture and Object Streaming

I have a need to stream a texture (essentially a camera feed).
With object streaming, the following scenarios seem to be arise:
Is the new object's data store larger, smaller or same size as the old one?
Subset of or whole texture being updated?
Are we streaming a buffer object or texture object (any difference?)
Here are the following approaches I have come across:
Allocate object data store (either BufferData for buffers or TexImage2D for textures) and then each frame, update subset of data with BufferSubData or TexSubImage2D
Nullify/invalidate the object after the last call (eg. draw) that uses the object either with:
Nullify: glTexSubImage2D( ..., NULL), glBufferSubData( ..., NULL)
Invalidate: glBufferInvalidate(), glMapBufferRange​ with the GL_MAP_INVALIDATE_BUFFER_BIT​, glDeleteTextures ?
Simpliy reinvoke BufferData or TexImage2D with the new data
Manually implement object multi-buffering / buffer ping-ponging.
Most immediately, my problem scenario is: entire texture being replaced with new one of same size. How do I implement this? Will (1) implicitly synchronize ? Does (2) avoid the synchronization? Will (3) synchronize or will a new data store for the object be allocated, where our update can be uploaded without waiting for all drawing using the old object state to finish? This passage from the Red Book V4.3 makes be believe so:
Data can also be copied between buffer objects using the
glCopyBufferSubData() function. Rather than assembling chunks of data
in one large buffer object using glBufferSubData(), it is possible to
upload the data into separate buffers using glBufferData() and then
copy from those buffers into the larger buffer using
glCopyBufferSubData(). Depending on the OpenGL implementation, it may
be able to overlap these copies because each time you call
glBufferData() on a buffer object, it invalidates whatever contents
may have been there before. Therefore, OpenGL can sometimes just
allocate a whole new data store for your data, even though a copy
operation from the previous store has not completed yet. It will then
release the old storage at a later opportunity.
But if so, why the need for (2)[nullify/invalidates]?
Also, please discuss the above approaches, and others, and their effectiveness for the various scenarios, while keeping in mind atleast the following issues:
Whether implicit synchronization to object (ie. synchronizing our update with OpenGL's usage) occurs
Memory usage
Speed
I've read http://www.opengl.org/wiki/Buffer_Object_Streaming but it doesn't offer conclusive information.
Let me try to answer at least a few of the questions you raised.
The scenarios you talk about can have a great impact on the performance on the different approaches, especially when considering the first point about the dynamic size of the buffer. In your scenario of video streaming, the size will rarely change, so a more expensive "re-configuration" of the data structures you use might be possible. If the size changes every frame or every few frames, this is typically not feasable. However, if a resonable maximum size limit can be enforced, just using buffers/textures with the maximum size might be a good strategy. Neither with buffers nor with textures you have to use all the space there is (although there are some smaller issues when you do this with texures, like wrap modes).
3.Are we streaming a buffer object or texture object (any difference?)
Well, the only way to efficiently stream image data to or from the GL is to use pixel buffer objects (PBOs). So you always have to deal with buffer objects in the first place, no matter if vertex data, image data or whatever data is to be tranfered. The buffer is just the source for some glTex*Image() call in the texture case, and of course you'll need a texture object for that.
Let's come to your approaches:
In approach (1), you use the "Sub" variant of the update commands. In that case, (parts of or the whole) storage of the existing object is updated. This is likely to trigger an implicit synchronziation ifold data is still in use. The GL has basically only two options: wait for all operations (potentially) depending on that data to complete, or make an intermediate copy of the new data and let the client go on. Both options are not good from a performance point of view.
In approach (2), you have some misconception. The "Sub" variants of the update commands will never invalidate/orphan your buffers. The "non-sub" glBufferData() will create a completely new storage for the object, and using it with NULL as data pointer will leave that storage unintialized. Internally, the GL implementation might re-use some memory which was in use for earlier buffer storage. So if you do this scheme, there is some probablity that you effectively end up using a ring-buffer of the same memory areas if you always use the same buffer size.
The other methods for invalidation you mentiond allow you to also invalidate parts of the buffer and also a more fine-grained control of what is happening.
Approach (3) is basically the same as (2) with the glBufferData() oprhaning, but you just specify the new data directly at this stage.
Approach (4) is the one I actually would recommend, as it is the one which gives the application the most control over what is happening, without having to relies on the GL implementation's specific internal workings.
Without taking synchronization into account, the "sub" variant of the update commands is
more efficient, even if the whole data storage is to be changed, not just some part. That is because the "non-sub" variants of the commands basically recreate the storage and introduce some overhead with this. With manually managing the ring buffers, you can avoid any of that overhead, and you don't have to rely in the GL to be clever, by just using the "sub" variants of the updates functions. At the same time, you can avoid implicit synchroniztion by only updating buffers which aren't in use by th GL any more. This scheme can also nicely be extenden into a multi-threaded scenario. You can have one (or several) extra threads with separate (but shared) GL contexts to fill the buffers for you, and just passing the buffer handlings to the draw thread as soon as the update is complete. You can also just map the buffers in the draw thread and let the be filled by worker threads (wihtout the need for additional GL contexts at all).
OpenGL 4.4 introduced GL_ARB_buffer_storage and with it came the GL_MAP_PERSISTEN_BIT for glMapBufferRange. That will allow you to keep all of the buffers mapped while they are used by the GL - so it allows you to avoid the overhead of mapping the buffers into the address space again and again. You then will have no implicit synchronzation at all - but you have to synchronize the operations manually. OpenGL's synchronization objects (see GL_ARB_sync) might help you with that, but the main burden on synchronization is on your applications logic itself. When streaming videos to the GL, just avoid re-using the buffer which was the source for the glTexSubImage() call immediately and try to delay its re-use as long as possible. You are of course also trading throughput for latency. If you need to minimize latency, you might to have to tweak this logic a bit.
Comparing the approaches for "memory usage" is really hard. There are a lot of of implementation specific details to consider here. A GL implementation might keep some old buffer memories around for some time to fullfill recreation requests of the same size. Also, an GL implementation might make shadow copies of any data at any time. The approaches which don't orphan and recreate storages all the time in principle expose more control of the memory which is in use.
"Speed" itself is also not a very useful metric. You basically have to balance throughput and latency here, according to the requirements of your application.

Non-interleaved vertex buffers DirectX11

If my vertex positions are shared, but my normals and UVs are not (to preserve hard edges and the likes), is it possible to use non-interleaved buffers in DirectX11 to solve this memory representation, such that I could use indice buffer with it? Or should I stick with duplicated vertex positions in an interleaved buffer?
And is there any performance concerns between interleaved and non-interleaved vertex buffers? Thank you!
How to
There are several ways. I'll describe the simplest one.
Just create separate vertex buffers:
ID3D11Buffer* positions;
ID3D11Buffer* texcoords;
ID3D11Buffer* normals;
Create input layout elements, incrementing InputSlot member for each component:
{ "POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "TEXCOORD", 0, DXGI_FORMAT_R32G32_FLOAT, 1, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "NORMAL", 0, DXGI_FORMAT_R32G32B32_FLOAT, 2, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0 },
// ^
// InputSlot
Bind buffers to their slots (better all in one shot):
ID3D11Buffer** vbs = {positions, texcoords, normals};
unsigned int strides[] = { /*strides go here*/ };
unsigned int offsets [] = { /*offsets go here*/ };
m_Context->IASetVertexBuffers(0, 3, vbs, strides, offsets );
Draw as usual.
You don't need to change HLSL code (HLSL will think as it have single buffer).
Note, that code snippets was written on-the-fly and can contain mistakes.
Edit: you can improve this approach, combining buffers by update rate: if texcoords and normals never changed, merge them.
As of performance
It is all about locality of references: the closer data, the faster access.
Interleaved buffer, in most cases, gives (by far) more performance for GPU side (i.e. rendering): for each vertex each attribute near each other. But separate buffers gives faster CPU access: arrays are contiguous, each next data is near previous.
So, overall, performance concerns depends on how often you writing to buffers. If your limiting factor is CPU writes, stick to separate buffers. If not, go for single one.
How will you know? Only one way - profile. Both, CPU side, and GPU side (via Graphics debugger/profiler from your GPU's vendor).
Another factors
The best practice is to limit CPU writes, so, if you will find that you are limited by buffer updating, you probably need to re-view your approach. Do we need to update buffer each frame if we have 500 fps? User won't see difference if you reduce buffer update rate to 30-60 times per second (unbind buffer update from frame update). So, if your updating strategy is reasonable, you will likely never be CPU-limited and best approach is classic interleaving.
You can also consider re-designing your data pipeline, or even somehow prepare data offline (we call it "baking"), so you will not need to cope with non-interleaved buffers. That will be quite reasonable too.
Reduce memory footprint or increase performance?
Memory-to-performance tradeoff. This is the eternal question. Duplicate memory to take advantages of interleaving? Or not?
Answer is... "that depends". You are programming new CryEngine, targeting top GPUs with gigabytes of memory? Or you're programming for embedded systems of mobile platform, where memory resources slow and limited? Does 1 megabyte memory worth hassle at all? Or you have huge models, 100 MB each? We don't know.
It's all up to you to decide. But remember: there are no free candies. If you'll find memory economy worth performance loss, do it. Profile and compare to be sure.
Hope it helps somehow. Happy coding! =)
Interleaved/Separate will mostly affect your Input Assembler stage (GPU side).
A perfect scenario for Interleaved is when your Buffer memory arrangement perfectly fits your vertex shader input. So your Input assembler can simply fetch the data.
In that case you'll be totally fine with interleaved, even tho testing with a large model (two versions of the same data, one interleaved, one separate), TimeStamp query didn't reported any major difference (some pretty minimal vertex processing and basic pixel shader).
Now having separate buffers makes it much easier to fine tune in case you use your geometry in different contexts.
Let's say you have Position/Normals/UV (like in your case).
Now you also have a shader in your pipeline that only requires Position (Shadow Map would be a pretty good example).
With separate buffers, you can simply create a new input layout which contains position only, And bind that buffer instead. Your IA stage has only to
load that buffer. Best of all you can even do that dynamically using shader reflection.
If you bind Interleaved data, you will have some overhead due to the fact it has to load with a stride.
When I tested that one I had about 20% gain using Separate instead of Interleaved, which can be quite decent, but since this type of processing can be largely architecture dependent, don't take it for granted (NVidia 740M for testing).
So simply put, profile (a lot), and check which gives you the best balance between your GPU and CPU loads.
Please also note that the overhead from Input Assembler will decrease from the complexity of your shader, if you apply some heavy calculations + add some tessellation + some decent shading, the time difference between interleaved/non interleaved will become progressively meaningless.
You should stick with interleaved buffers. Any other technique will require some form of indirection to your non-duplicated position buffer, which will cost you performance and cache efficiency.