Various vertex format and vbo - opengl

I use a rendering loop as follow;
Orphan the data and map the buffer.
Record the command and write the generated vertex in the buffer.
Unmap the buffer.
Iterate over the commands that can change states, bind textures or draw.
At the moment I use a single interleaved vertex format (SoA) that has all the attribute any of my shaders can use.
struct OneSizeFitAllVertex
{
float pos[3];
float uv0[2];
float uv1[2];
float col[4];
};
When using simpler shader that only use the position and color for example, I would only write the attribute that I care about in the mapped memory and the shader code would simply ignore all the unused attributes.
Because this feel wasteful, I am considering using different vertex format for each of my shaders.
Simple objects rendered using the simple shader would use SimpleVertex:
struct SimpleVertex
{
float pos[3];
float col[4];
};
While others, multi-textured objects, would be rendered using multitexture shader and use MultitextureVertex:
struct MultitextureVertex
{
float pos[3];
float uv0[2];
float uv1[2];
};
How should I handle these different format?
Should I write all the vertex of different format in the same mapped buffer and change my AttribPointers before drawing? This would save some space.
Should I map a different buffers for each vertex formats? Perhaps more efficient.
Or should I keep the 'one size fit all' vertex format? This is easier.
I am curious to learn what is the best practice in this situation.
Thanks.

There can be a lot of variations based on your underlying system architecture, but let's assume you're using a discrete GPU (e.g., AMD or NVIDIA) with dedicated graphics memory.
You don't mention if you interleave your attributes in an array of structures (AoS) (perhaps something like the following):
struct Vertex {
float position[3];
float normal[3];
float uv[2];
...
};
or group similar attributes together (commonly called a structure of arrays or (SoA))
struct VertexAttributes {
float positions[N][3];
float normals[N][3];
float uv[N][2];
...
};
This is relevant given you're buffer-mapping approach. When you map in an entire buffer, the GPU likely needs to copy its version of the buffer to the CPU, who provides you the pointer to update the values. When you unmap the buffer, the driver will copy the buffer back to the GPU. With the AoS layout and your technique, you'll touch small subsections of the entire buffer, and since the GPU driver doesn't know which bits of memory you've updated, it's only recourse is to copy the entire thing back to the GPU. Depending on the size, this can have significant impact at several levels (poor CPU cache-read utilization, consuming lots of CPU-to-GPU bus bandwidth, etc.). Unfortunately, there aren't good alternatives if you're only updating a small fraction of your vertex attributes. However, if you're updating all of the attributes, this approach is okay (although it's often recommended to use glBufferSubData or similar commands, which save the read back from the GPU to the CPU).
Conversely, if you're using the SoA approach, mapping the entire buffer in will cause the similar problems, but the situation can be better. Since the values for a specific attribute are contiguous in memory, you can use something like glMapBufferRange to only map in the memory you need (but again, I'd still recommend using glBufferSubData for the previously stated reasons). Given your current scenario, this is what I'd recommend.

Related

Is it possible to construct vertex on GPU from a non-XYZ vertex buffer?

I'm writing a particle simulation where the logic is updated using Intel AVX. I'm using a SoA approach to maximize my "SIMD-friendliness" but I shuffle the particle position components into XYZ-format
when updating the vertex buffer.
Is it possible to exclude the shuffle part and simply pass the vertex data in
XXYYZZ-format and construct each vertex in a shader stage?
My first thought was using three vertex buffers with x, y and z components separated and construct each vertex using the same subscript index to access the x, y and z component of a vertex.
I'm aware that this is not the conventional way but I would like to emphasize that this is just an experiment. Perhaps anyone got some knowledge about this approach (if it is even possible) and/or could point me in the right direction? Perhaps there is a name to it aswell?
Thank you!
There is no restriction on how you feed the GPU with your vertices. You can customize the input layout to read values from any number of vertex buffers, in your example, you will have at least three elements. In the vertex shader, you receive your three elements as three scalars and swizzle them back. The only real limitation is that each value are at the same index in each buffer.
In regards to performance, unless you want to get the top 1% performance of the GPU, you will see no difference compared to a well interleaved vertex. This influence mostly the bandwidth and L2 cache miss, so unless you have crazy millions of particles, it is unlikely to happen. And if you have, you can use a compute shader to interleave the data in a pre-process.

How do I deal with many variables per triangle in OpenGL?

I'm working with OpenGL and am not totally happy with the standard method of passing values PER TRIANGLE (or in my case, quads) that need to make it to the fragment shader, i.e., assign them to each vertex of the primitive and pass them through the vertex shader to presumably be unnecessarily interpolated (unless using the "flat" directive) in the fragment shader (so in other words, non-varying per fragment).
Is there some way to store a value PER triangle (or quad) that needs to be accessed in the fragment shader in such a way that you don't need redundant copies of it per vertex? Is so, is this way better than the likely overhead of 3x (or 4x) the data moving code CPU side?
I am aware of using geometry shaders to spread the values out to new vertices, but I heard geometry shaders are terribly slow on non up to date hardware. Is this the case?
OpenGL fragment language supports the gl_PrimitiveID input variable, which will be the index of the primitive for the currently processed fragment (starting at 0 for each draw call). This can be used as an index into some data store which holds per-primitive data.
Depending on the amount of data that you will need per primitive, and the number of primitives in total, different options are available. For a small number of primitives, you could just set up a uniform array and index into that.
For a reasonably high number of primitives, I would suggest using a texture buffer object (TBO). This is basically an ordinary buffer object, which can be accessed read-only at random locations via the texelFetch GLSL operation. Note that TBOs are not really textures, they only reuse the existing texture object interface. Internally, it is still a data fetch from a buffer object, and it is very efficient with none of the overhead of the texture pipeline.
The only issue with this approach is that you cannot easily mix different data types. You have to define a base data type for your TBO, and every fetch will get you the data in that format. If you just need some floats/vectors per primitive, this is not a problem at all. If you e.g. need some ints and some floats per primitive, you could either use different TBOs, one for each type, or with modern GLSL (>=3.30), you could use an integer type for the TBO and reinterpret the integer bits as floating point with intBitsToFloat(), so you can get around that limitation, too.
You can use one element in the vertex array for rendering multiple vertices. It's called instanced vertex attributes.

What is the purpose of OpenGL texture buffer objects?

We use buffer objects for reducing copy operations from CPU-GPU and for texture buffer objects we can change target from vertex to texture in buffer objects. Is there any other advantage here of texture buffer objects? Also, it does not allow filtering, is there any disadvantage of this?
A buffer texture is similar to a 1D-texture but has a backing buffer store that's not part of the texture object (in contrast to any other texture object) but realized with an actual buffer object bound to TEXTURE_BUFFER. Using a buffer texture has several implications and, AFAIK, one use-case that can't be mapped to any other type of texture.
Note that a buffer texture is not a buffer object - a buffer texture is merely associated with a buffer object using glTexBuffer.
By comparison, buffer textures can be huge. Table 23.53 and following of the core OpenGL 4.4 spec defines a minimum maximum (i.e. the minimal value that implementations must provide) number of texels MAX_TEXTURE_BUFFER_SIZE. The potential number of texels being stored in your buffer object is computed as follows (as found in GL_ARB_texture_buffer_object):
floor(<buffer_size> / (<components> * sizeof(<base_type>))
The resulting value clamped to MAX_TEXTURE_BUFFER_SIZE is the number of addressable texels.
Example:
You have a buffer object storing 4MiB of data. What you want is a buffer texture for addressing RGBA texels, so you choose an internal format RGBA8. The addressable number of texels is then
floor(4MiB / (4 * sizeof(UNSIGNED_BYTE)) == 1024^2 texels == 2^20 texels
If your implementation supports this number, you can address the full range of values in your buffer object. The above isn't too impressive and can simply be achieved with any other texture on current implementations. However, the machine on which I'm writing this answer supports 2^28 == 268435456 texels.
With OpenGL 4.4 (and 4.3 and possibly with earlier 4.x versions), the MAX_TEXTURE_SIZE is 2 ^ 16 texels per 1D-texture, so a buffer texture can still be 4 times as large. On my local machine I can allocate a 2GiB buffer texture (even larger actually), but only a 1GiB 1D-texture when using RGBAF32 texels.
A use-case for buffer textures is random (and atomic, if desired) read-/write-access (the latter via image load/store) to a large data store inside a shader. Yes, you can do random read-access on arrays of uniforms inside one or multiple blocks but it get's very tedious if you have to process a lot of data and have to work with multiple blocks and even then, looking at the maximum combined size of all uniform components (where a single float component has a size of 4 bytes) in all uniform blocks for a single stage,
MAX_(stage)_UNIFORM_BLOCKS *
MAX_UNIFORM_BLOCK_SIZE +
MAX_(stage)_UNIFORM_COMPONENTS * 4
isn't really a lot of space to work with in a shader stage (depending on how large your implementation allows the above number to be).
An important difference between textures and buffer textures is that the data store, as a regular buffer object, can be used in operations where a texture simply does not work. The extension mentions:
The use of a buffer object to provide storage allows the texture data to
be specified in a number of different ways: via buffer object loads
(BufferData), direct CPU writes (MapBuffer), framebuffer readbacks
(EXT_pixel_buffer_object extension). A buffer object can also be loaded
by transform feedback (NV_transform_feedback extension), which captures
selected transformed attributes of vertices processed by the GL. Several
of these mechanisms do not require an extra data copy, which would be
required when using conventional TexImage-like entry points.
An implication of using buffer textures is that look-ups inside a shader can only be done via texelFetch. Buffer textures also aren't mip-mapped and, as you already mentioned, during fetches there is no filtering.
Addendum:
Since OpenGL 4.3, we have what is called a
Shader Storage Buffer. These too provide random (atomic) read-/write-access to a large data store but don't need to be accessed with texelFetch() or image load/store functions as is the case for buffer textures. Using buffer textures also implies having to deal with gvec4 return values, both with texelFetch() and imageLoad() / imageStore(). This becomes very tedious as soon as you want to work with structures (or arrays thereof) and you don't want to think of some stupid packing scheme using multiple instances of vec4 or using multiple buffer textures to achieve something similar. With a buffer accessed as shader storage, you can simple index into the data store and pull one or more instances of some struct {} directly from the buffer.
Also, since they are very similar to uniform blocks, using them should be fairly straight forward - if you know how to use uniform buffers, you don't have a long way to go learn how to use shader storage buffers.
It's also absolutely worth browsing the Issues section of the corresponding ARB extension.
Performance Implications
Daniel Rakos did some performance analysis years ago, both as a comparison of uniform buffers and buffer textures, and also on a little more general note based on information from AMD's OpenCL programming guide. There is now a very recent version, specifically targeting OpenCL optimization an AMD platforms.
There are many factors influencing performance:
access patterns and resulting caching behavior
cache line sizes and memory layou
what kind of memory is accessed (registers, local, global, L1/L2 etc.) and its respective memory bandwidth
how well memory fetching latency is hidden by doing something else in the meantime
what kind of hardware you're on, i.e. a dedicated graphics card with dedicated memory or some unified memory architecture
etc., etc.
As always when worrying about performance: implement something that works and see if that solutions is fast enough for your needs. Otherwise, implement two or more approaches to solving the problem, profile them and compare.
Also, vendor specific guides can offer a great deal of insight. The above mentioned OpenCL user and optimization guides provide a high-level architectural perspective and specific hints on how to optimize your CL kernels - stuff that's also relevant when developing shaders.
A one use case I have found was to store per primitive attributes (accessed in the fragment shader with help of gl_PrimitiveID) while still maintaining unique vertices in the indexed mesh.

How does interleaved vertex submission help performance?

I have read and seen other questions that all generally point to the suggestion to interleav vertex positions and colors, etc into one array, as this minimizes the data that gets sent from cpu to gpu.
What I'm not clear on is how OpenGL does this when, even with an interleaved array, you must still make separate GL calls for position and color pointers. If both pointers use the same array, just set to start at different points in that array, does the draw call not copy the array twice since it was the object of two different pointers?
This is mostly about cache. For example, imagine we have 4 vertex and 4 colors. You can provide the information this way (excuse me but I don't remember the exact function names)
glVertexPointer(..., vertex);
glColorPointer(..., colors);
What it internally does, is read vertex[0], then apply colors[0], then again vertex[1] with colors[1]. As you can see, if vertex is, for example, 20 megs long, vertex[0] and colors[0] will be, to say the least, 20 megabytes apart from each other.
Now, on the other hand, if you provide a structure like { vertex0, color0, vertex1, color1, etc.} there will be a lot of cache hits because, well, vertex0 and color0 are together, and so are vertex1 and color1.
Hope this helps answer the question
edit: on second read, I may not have answered the question. You might probably be wondering how does OpenGL know which values to read from that structure, maybe? Like I said before with a structure such as { vertex, color, vertex, color } you tell OpenGL that vertex is at position 0, with an offset of 2 (so next one will be at position 2, then 4, etc) and color starts at position 1, with an offset of 2 also (so position 1, then 3, etc).
addition: In case you want a more practical example, look at this link http://www.lwjgl.org/wiki/index.php?title=Using_Vertex_Buffer_Objects_(VBO). You can see there how it only provides the buffer once and then uses offsets to render efficiently.
I suggest reading: Vertex_Specification_Best_Practices
h4lc0n provided quite nice explanation, but I would like add some additional info:
interleaved data can actually hurt performance when your data often changes. For instance when you change position of point sprites, you update POS, but COLOR and TEXCOORD are usually the same. Then, when data is interleaved you must "touch" additional data. In that case it would be better to have one VBO for POS only (or in general for data that changes often) and the second VBO for data that is constant.
it is not easy to give strict rules about VBO layout, since it is very vendor/driver specific. Also your usage can be different from others. In general it is needed to make some benchmarks for your particular test cases
You could also make an argument for separating different attributes. Assuming a GPU does not process one vertex after another but rather a bunch (ex. 16) of them in parallel, you would would get something like this while executing a vertex shader:
read attribute A for all 16 vertices
perform some computations
read attribute B for all 16 vertices
perform some more computations
....
So you read one attribute for many vertices at once. From this reasoning it would seem that interleaving the attributes actually hurts the performance. Of cours this would only be visible if you are either bandwidth constrained or if the memory latency cannot be hidden for some reason (ex. a complex shader that requires many registers will reduce the number of vertices that can be in flight at a given time).

Is it possible to loop through a second VBO in the vertex shader?

So, let say that I have two vertex buffers. One that describes the actual shape I want to draw, and the other one is able to influence the first one.
So, what I actually want to be able to do is something like this:
uniform VBO second_one;
void main()
{
for (int i = 0; i < size_of_array(second_one); ++i)
Do things with second_one[i] to alter the values
create the output informations
}
Things I might want to do can be gravity, that that each point in second_one tries to drag a bit the point closer to it and so on and then after the point is adjusted, apply the matrices to have its actual location.
I would be really surprise that it's possible, or something close to it. But the whole point is to be able to use a second VBO, or the make it as a uniform of type vec3 let say so I can access it.
For what you're wanting, you have three options.
An array of uniforms. GLSL lets you do uniform vec3 stuff[50];. And arrays in GLSL have a .length() method, so you can find out how big they are. Of course, there are limits to the number of uniforms you use, but you shouldn't need more than 20-30 of these. Anything more than that and you'll really feel the performance drain.
Uniform buffer objects. These can store a bit more data than non-block uniforms, but they still have limits. And the storage comes from a buffer object. But accesses to them are, depending on hardware, slightly slower than accesses to direct uniforms.
Buffer textures. This is a way to attach a buffer object to a texture. With this, you can access vast amounts of memory from within a shader. But be warned: they're not fast to access. If you can make due with one of the above methods, do so.
Note that #2 and #3 will only be found on hardware capable of supporting GL 3.x and above. So DX10-class hardware.