How does interleaved vertex submission help performance? - opengl

I have read and seen other questions that all generally point to the suggestion to interleav vertex positions and colors, etc into one array, as this minimizes the data that gets sent from cpu to gpu.
What I'm not clear on is how OpenGL does this when, even with an interleaved array, you must still make separate GL calls for position and color pointers. If both pointers use the same array, just set to start at different points in that array, does the draw call not copy the array twice since it was the object of two different pointers?

This is mostly about cache. For example, imagine we have 4 vertex and 4 colors. You can provide the information this way (excuse me but I don't remember the exact function names)
glVertexPointer(..., vertex);
glColorPointer(..., colors);
What it internally does, is read vertex[0], then apply colors[0], then again vertex[1] with colors[1]. As you can see, if vertex is, for example, 20 megs long, vertex[0] and colors[0] will be, to say the least, 20 megabytes apart from each other.
Now, on the other hand, if you provide a structure like { vertex0, color0, vertex1, color1, etc.} there will be a lot of cache hits because, well, vertex0 and color0 are together, and so are vertex1 and color1.
Hope this helps answer the question
edit: on second read, I may not have answered the question. You might probably be wondering how does OpenGL know which values to read from that structure, maybe? Like I said before with a structure such as { vertex, color, vertex, color } you tell OpenGL that vertex is at position 0, with an offset of 2 (so next one will be at position 2, then 4, etc) and color starts at position 1, with an offset of 2 also (so position 1, then 3, etc).
addition: In case you want a more practical example, look at this link http://www.lwjgl.org/wiki/index.php?title=Using_Vertex_Buffer_Objects_(VBO). You can see there how it only provides the buffer once and then uses offsets to render efficiently.

I suggest reading: Vertex_Specification_Best_Practices
h4lc0n provided quite nice explanation, but I would like add some additional info:
interleaved data can actually hurt performance when your data often changes. For instance when you change position of point sprites, you update POS, but COLOR and TEXCOORD are usually the same. Then, when data is interleaved you must "touch" additional data. In that case it would be better to have one VBO for POS only (or in general for data that changes often) and the second VBO for data that is constant.
it is not easy to give strict rules about VBO layout, since it is very vendor/driver specific. Also your usage can be different from others. In general it is needed to make some benchmarks for your particular test cases

You could also make an argument for separating different attributes. Assuming a GPU does not process one vertex after another but rather a bunch (ex. 16) of them in parallel, you would would get something like this while executing a vertex shader:
read attribute A for all 16 vertices
perform some computations
read attribute B for all 16 vertices
perform some more computations
....
So you read one attribute for many vertices at once. From this reasoning it would seem that interleaving the attributes actually hurts the performance. Of cours this would only be visible if you are either bandwidth constrained or if the memory latency cannot be hidden for some reason (ex. a complex shader that requires many registers will reduce the number of vertices that can be in flight at a given time).

Related

Is it possible to construct vertex on GPU from a non-XYZ vertex buffer?

I'm writing a particle simulation where the logic is updated using Intel AVX. I'm using a SoA approach to maximize my "SIMD-friendliness" but I shuffle the particle position components into XYZ-format
when updating the vertex buffer.
Is it possible to exclude the shuffle part and simply pass the vertex data in
XXYYZZ-format and construct each vertex in a shader stage?
My first thought was using three vertex buffers with x, y and z components separated and construct each vertex using the same subscript index to access the x, y and z component of a vertex.
I'm aware that this is not the conventional way but I would like to emphasize that this is just an experiment. Perhaps anyone got some knowledge about this approach (if it is even possible) and/or could point me in the right direction? Perhaps there is a name to it aswell?
Thank you!
There is no restriction on how you feed the GPU with your vertices. You can customize the input layout to read values from any number of vertex buffers, in your example, you will have at least three elements. In the vertex shader, you receive your three elements as three scalars and swizzle them back. The only real limitation is that each value are at the same index in each buffer.
In regards to performance, unless you want to get the top 1% performance of the GPU, you will see no difference compared to a well interleaved vertex. This influence mostly the bandwidth and L2 cache miss, so unless you have crazy millions of particles, it is unlikely to happen. And if you have, you can use a compute shader to interleave the data in a pre-process.

How do I deal with many variables per triangle in OpenGL?

I'm working with OpenGL and am not totally happy with the standard method of passing values PER TRIANGLE (or in my case, quads) that need to make it to the fragment shader, i.e., assign them to each vertex of the primitive and pass them through the vertex shader to presumably be unnecessarily interpolated (unless using the "flat" directive) in the fragment shader (so in other words, non-varying per fragment).
Is there some way to store a value PER triangle (or quad) that needs to be accessed in the fragment shader in such a way that you don't need redundant copies of it per vertex? Is so, is this way better than the likely overhead of 3x (or 4x) the data moving code CPU side?
I am aware of using geometry shaders to spread the values out to new vertices, but I heard geometry shaders are terribly slow on non up to date hardware. Is this the case?
OpenGL fragment language supports the gl_PrimitiveID input variable, which will be the index of the primitive for the currently processed fragment (starting at 0 for each draw call). This can be used as an index into some data store which holds per-primitive data.
Depending on the amount of data that you will need per primitive, and the number of primitives in total, different options are available. For a small number of primitives, you could just set up a uniform array and index into that.
For a reasonably high number of primitives, I would suggest using a texture buffer object (TBO). This is basically an ordinary buffer object, which can be accessed read-only at random locations via the texelFetch GLSL operation. Note that TBOs are not really textures, they only reuse the existing texture object interface. Internally, it is still a data fetch from a buffer object, and it is very efficient with none of the overhead of the texture pipeline.
The only issue with this approach is that you cannot easily mix different data types. You have to define a base data type for your TBO, and every fetch will get you the data in that format. If you just need some floats/vectors per primitive, this is not a problem at all. If you e.g. need some ints and some floats per primitive, you could either use different TBOs, one for each type, or with modern GLSL (>=3.30), you could use an integer type for the TBO and reinterpret the integer bits as floating point with intBitsToFloat(), so you can get around that limitation, too.
You can use one element in the vertex array for rendering multiple vertices. It's called instanced vertex attributes.

Non-interleaved vertex buffers DirectX11

If my vertex positions are shared, but my normals and UVs are not (to preserve hard edges and the likes), is it possible to use non-interleaved buffers in DirectX11 to solve this memory representation, such that I could use indice buffer with it? Or should I stick with duplicated vertex positions in an interleaved buffer?
And is there any performance concerns between interleaved and non-interleaved vertex buffers? Thank you!
How to
There are several ways. I'll describe the simplest one.
Just create separate vertex buffers:
ID3D11Buffer* positions;
ID3D11Buffer* texcoords;
ID3D11Buffer* normals;
Create input layout elements, incrementing InputSlot member for each component:
{ "POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "TEXCOORD", 0, DXGI_FORMAT_R32G32_FLOAT, 1, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{ "NORMAL", 0, DXGI_FORMAT_R32G32B32_FLOAT, 2, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0 },
// ^
// InputSlot
Bind buffers to their slots (better all in one shot):
ID3D11Buffer** vbs = {positions, texcoords, normals};
unsigned int strides[] = { /*strides go here*/ };
unsigned int offsets [] = { /*offsets go here*/ };
m_Context->IASetVertexBuffers(0, 3, vbs, strides, offsets );
Draw as usual.
You don't need to change HLSL code (HLSL will think as it have single buffer).
Note, that code snippets was written on-the-fly and can contain mistakes.
Edit: you can improve this approach, combining buffers by update rate: if texcoords and normals never changed, merge them.
As of performance
It is all about locality of references: the closer data, the faster access.
Interleaved buffer, in most cases, gives (by far) more performance for GPU side (i.e. rendering): for each vertex each attribute near each other. But separate buffers gives faster CPU access: arrays are contiguous, each next data is near previous.
So, overall, performance concerns depends on how often you writing to buffers. If your limiting factor is CPU writes, stick to separate buffers. If not, go for single one.
How will you know? Only one way - profile. Both, CPU side, and GPU side (via Graphics debugger/profiler from your GPU's vendor).
Another factors
The best practice is to limit CPU writes, so, if you will find that you are limited by buffer updating, you probably need to re-view your approach. Do we need to update buffer each frame if we have 500 fps? User won't see difference if you reduce buffer update rate to 30-60 times per second (unbind buffer update from frame update). So, if your updating strategy is reasonable, you will likely never be CPU-limited and best approach is classic interleaving.
You can also consider re-designing your data pipeline, or even somehow prepare data offline (we call it "baking"), so you will not need to cope with non-interleaved buffers. That will be quite reasonable too.
Reduce memory footprint or increase performance?
Memory-to-performance tradeoff. This is the eternal question. Duplicate memory to take advantages of interleaving? Or not?
Answer is... "that depends". You are programming new CryEngine, targeting top GPUs with gigabytes of memory? Or you're programming for embedded systems of mobile platform, where memory resources slow and limited? Does 1 megabyte memory worth hassle at all? Or you have huge models, 100 MB each? We don't know.
It's all up to you to decide. But remember: there are no free candies. If you'll find memory economy worth performance loss, do it. Profile and compare to be sure.
Hope it helps somehow. Happy coding! =)
Interleaved/Separate will mostly affect your Input Assembler stage (GPU side).
A perfect scenario for Interleaved is when your Buffer memory arrangement perfectly fits your vertex shader input. So your Input assembler can simply fetch the data.
In that case you'll be totally fine with interleaved, even tho testing with a large model (two versions of the same data, one interleaved, one separate), TimeStamp query didn't reported any major difference (some pretty minimal vertex processing and basic pixel shader).
Now having separate buffers makes it much easier to fine tune in case you use your geometry in different contexts.
Let's say you have Position/Normals/UV (like in your case).
Now you also have a shader in your pipeline that only requires Position (Shadow Map would be a pretty good example).
With separate buffers, you can simply create a new input layout which contains position only, And bind that buffer instead. Your IA stage has only to
load that buffer. Best of all you can even do that dynamically using shader reflection.
If you bind Interleaved data, you will have some overhead due to the fact it has to load with a stride.
When I tested that one I had about 20% gain using Separate instead of Interleaved, which can be quite decent, but since this type of processing can be largely architecture dependent, don't take it for granted (NVidia 740M for testing).
So simply put, profile (a lot), and check which gives you the best balance between your GPU and CPU loads.
Please also note that the overhead from Input Assembler will decrease from the complexity of your shader, if you apply some heavy calculations + add some tessellation + some decent shading, the time difference between interleaved/non interleaved will become progressively meaningless.
You should stick with interleaved buffers. Any other technique will require some form of indirection to your non-duplicated position buffer, which will cost you performance and cache efficiency.

Is it possible to loop through a second VBO in the vertex shader?

So, let say that I have two vertex buffers. One that describes the actual shape I want to draw, and the other one is able to influence the first one.
So, what I actually want to be able to do is something like this:
uniform VBO second_one;
void main()
{
for (int i = 0; i < size_of_array(second_one); ++i)
Do things with second_one[i] to alter the values
create the output informations
}
Things I might want to do can be gravity, that that each point in second_one tries to drag a bit the point closer to it and so on and then after the point is adjusted, apply the matrices to have its actual location.
I would be really surprise that it's possible, or something close to it. But the whole point is to be able to use a second VBO, or the make it as a uniform of type vec3 let say so I can access it.
For what you're wanting, you have three options.
An array of uniforms. GLSL lets you do uniform vec3 stuff[50];. And arrays in GLSL have a .length() method, so you can find out how big they are. Of course, there are limits to the number of uniforms you use, but you shouldn't need more than 20-30 of these. Anything more than that and you'll really feel the performance drain.
Uniform buffer objects. These can store a bit more data than non-block uniforms, but they still have limits. And the storage comes from a buffer object. But accesses to them are, depending on hardware, slightly slower than accesses to direct uniforms.
Buffer textures. This is a way to attach a buffer object to a texture. With this, you can access vast amounts of memory from within a shader. But be warned: they're not fast to access. If you can make due with one of the above methods, do so.
Note that #2 and #3 will only be found on hardware capable of supporting GL 3.x and above. So DX10-class hardware.

GLSL per vertex fixed size array

Is it possible in desktop GLSL to pass a fixed size array of floats to the vertex shader as an attribute? If yes, how?
I want to have per vertex weights for character animation so I would like to have something like the following in my vertex shader:
attribute float weights[25];
How would I fill the attribute array from my C++ & OpenGL program? I have seen in another question that I could get the attribute location of the array attribute and then just add the index to that location. Could someone give an example on that for my pretty large array?
Thanks.
Let's start with what you asked for.
On pretty much no hardware that exists currently will attribute float weights[25]; compile. While shaders can have arrays of attributes, each array index represents a new attribute index. And on all hardware the currently exists, the maximum number of attribute indices is... 16. You'd need 25, and that's just for the weights.
Now, you can mitigate this easily enough by remembering that you can use vec4 attributes. Thus, you store every four array elements in a single attribute. Your array would be attribute vec4 weights[7]; which is doable. Your weight-fetching logic will have to change of course.
Even so, you don't seem to be taking in the ramifications of what this would actually mean for your vertex data. Each attribute represents a component of a vertex's data. Each vertex for a rendering call will have the same amount of data; the contents of that data will differ, but not how much data.
In order to do what you're suggesting, every vertex in your mesh would need 25 floats describing the weight. Even if this was stored as normalized unsigned bytes, that's still 25 extra bytes worth of data at a minimum. That's a lot. Especially considering that for the vast majority of vertices, most of these values will be 0. Even in the worst case, you'd be looking at maybe 6-7 bones affecting an single vertex.
The way skinning is generally done in vertex shaders is to limit the number of bones that affects a single vertex to four. This way, you don't use an array of attributes; you just use a vec4 attribute for the weights. Of course, you also now need to say which bone is associated with which weight. So you have a second vec4 attribute that specifies the bone index for that weight.
This strikes a good balance. You only take up 2 extra attributes (which can be unsigned bytes in terms of size). And for the vast majority of vertices, you'll never even notice, because most vertices are only influenced by 1-3 bones. A few uses 4, and fewer still use 5+. In those cases, you just cut off the lowest weights and recompute the weights of the others proportionately.
Nicol Bolas already gave you an answer how to restructure your task. You should do it, because processing 25 floats for a vertex, probably through some quaternion multiplication will waste a lot of good GPU processing power; most of the attributes for a vertex will translate close to an identity transform anyway.
However for academic reasons I'm going to tell you, how to pass 25 floats per vertex. The key is not using attributes for this, but fetching the data from some buffer, a texture. The GLSL vertex shader stage has the builtin variable gl_VertexID, which passes the index of the currently processed vertex. With recent OpenGL you can access textures from the vertex shader as well. So you have a texture of size vertex_count × 25 holding the values. In your vertex shader you can access them using the texelFetch function, i.e. texelFetch(param_buffer, vec2(gl_VertexID, 3));
If used in skeletal animation this system is often referred to as texture skinning. However it should be used sparingly, as it's a real performance hog. But sometimes you can't avoid it, for example when implementing a facial animation system where you have to weight all the vertices to 26 muscles, if you want to accurately simulate a human face.