Shader Storage Block vs Uniform Blocks - c++

I've read now, that you can't write to uniform blocks, so shader storage block has an advantage over uniform blocks. Furthermore the size of a shader storage block (the upper limit) is much higher.
What i don't get is the atomic operations attribute of a shader Storage Block, when can this become handy? Is there a real-life example?
Furthermore When i would prefer one over the other?

I think your question is ill-posed. It sounds like you are trying to figure out the difference between uniform buffers and shader storage buffers. Blocks are simply a way to organize your shader inputs and output.
As you noted, the biggest difference between uniform buffers and shader storage buffers is that you can write to shader storage buffers from your shader programs.
Asking why writing to a ssbo is handy is like asking why a variable is handy. Anytime you want to accumulate results or share data between render passes you can use the ssbo as "scratch memory".
In the old days (I believe) you would have had to do a render to texture if you wanted to share data, and that would have gone through the whole entire graphics pipeline with all the cost that that entails.
More about uniform buffer objects here:
https://www.opengl.org/wiki/Shader_Storage_Buffer_Object
To really make sure you understand the difference, look up the various ways to supply shaders with data in chronological order:
Textures & Frame Buffer Objects
Uniforms
Uniform Buffer Objects
Texture Buffer Objects
Textures with Image Load/Store
Shader Storage Buffer Objects
This answer on SO has a nice overview of almost all of them:
Passing a list of values to fragment shader

Related

Should Uniform Buffer Objects be used over glUniform when uniform block size is small?

I've recently replaced all usage of glUniform() in my code to instead use Uniform Buffer Objects (UBO) to better organize my object rendering code. However, my performance has sharply plummeted as the number of objects drawn increases (an object in my case is anything with it's own offset into a UBO; only one UBO is shared between everything that uses the same shader program).
I think this may be because my objects actually use very little uniform data, sometimes as little as one vec4, while a UBO requires each offset be in multiples of GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT which can be rather large.
I understood UBO to be a strict upgrade to using glUniform(), am I mistaken? Should I have a glUniform() path for these simple objects, or is there a necessary step to using UBOs effectively with many different simple objects that I haven't taken?
My scene is mostly quads, with very few objects using more than 4 verts, 6 indices. Some of these quads have large uniform blocks while most have very simple ones, maybe one or two vec4.
(For now, I've gotten around this problem by embedding this object data into the vertices, though it definitely feels wrong to do it this way.)

Shader storage buffer object slow – alternative?

I'm trying to make an array of vec3 available to a fragment shader. In the targeted application, there could be several hundred elements.
I tested transferring data in the form of a shader storage buffer object, declared as
layout(binding = 0) buffer voxels { vec3 xyz[]; }
and set using glBufferData, but I found that my fragment shader becomes very slow, even with only 33 elements.
Moreover, when I convert the same data into the GLSL code of a const vec3[] and include it in the shader code, the shader becomes noticeably faster.
Is there a better way – faster than an SSBO and more elegant than creating shader code?
As might already be apparent from the above, the array is only read from in the shader. It is constant within the shader as well as over shader invocations for different fragments, so effectively a uniform, and it is set only once or a few times over the runtime of the program.
I'd recommend using std430 layout specifier on the SSBO given that you are using vec3 data types, otherwise you'll be forced to pad the data, which isn't going to be great. In general, if the buffer is a fixed size, then prefer using glBufferSubData instead of glBufferData (the latter may reallocate memory on the GPU).
As yet another alternative, if you are able to target GL 4.4+, consider using glBufferStorage instead (or even better, if GL4.5 is available, use glCreateuffers, and glNamedBufferStorage). This let's you pass a few more hints to the GL driver about the way in which the buffer will be consumed. I'd try out a few options (e.g. mapping v.s. sub-data v.s. recreating each time).

Why samplers cannot be part of uniform blocks in OpenGL, and any ways to get around it?

I want to render a scene to texture, and share the texture sampler in several programs. Similar to share the project view matrix in multiple programs. Unlike project view matrix, which can be put into uniform blocks, "Samplers cannot be part of uniform blocks". https://www.opengl.org/wiki/Uniform_(GLSL)
Some discussion here describes why:
multiple issues:
GLSL samplers embody two things: source of texture data and how to filter from it (nearest, linear, mipmapping, anisotropic, etc)
Samples are opaque things, and to place them into a uniform buffer object requires they have a well defined (and cross-platform) size.
Those issues together make it dicey.
I want to see some explanation of why this cannot be handled and any other ways to achieve sharing texture samplers.
The explanation is right here:
Samples are opaque things, and to place them into a uniform buffer object requires they have a well defined (and cross-platform) size.
The purpose of uniform blocks is, that you can set then with a single OpenGL call from a single structure in your (client) program. But for this to work the memory layout of this structure must be known, so that the memory layout produced by the compiler matches that of the shader uniform block.
But since the memory layout of samplers is not defined a memory layout can not be determined. Without a definitive memory layout no structure, without a structure no block.

How do I deal with many variables per triangle in OpenGL?

I'm working with OpenGL and am not totally happy with the standard method of passing values PER TRIANGLE (or in my case, quads) that need to make it to the fragment shader, i.e., assign them to each vertex of the primitive and pass them through the vertex shader to presumably be unnecessarily interpolated (unless using the "flat" directive) in the fragment shader (so in other words, non-varying per fragment).
Is there some way to store a value PER triangle (or quad) that needs to be accessed in the fragment shader in such a way that you don't need redundant copies of it per vertex? Is so, is this way better than the likely overhead of 3x (or 4x) the data moving code CPU side?
I am aware of using geometry shaders to spread the values out to new vertices, but I heard geometry shaders are terribly slow on non up to date hardware. Is this the case?
OpenGL fragment language supports the gl_PrimitiveID input variable, which will be the index of the primitive for the currently processed fragment (starting at 0 for each draw call). This can be used as an index into some data store which holds per-primitive data.
Depending on the amount of data that you will need per primitive, and the number of primitives in total, different options are available. For a small number of primitives, you could just set up a uniform array and index into that.
For a reasonably high number of primitives, I would suggest using a texture buffer object (TBO). This is basically an ordinary buffer object, which can be accessed read-only at random locations via the texelFetch GLSL operation. Note that TBOs are not really textures, they only reuse the existing texture object interface. Internally, it is still a data fetch from a buffer object, and it is very efficient with none of the overhead of the texture pipeline.
The only issue with this approach is that you cannot easily mix different data types. You have to define a base data type for your TBO, and every fetch will get you the data in that format. If you just need some floats/vectors per primitive, this is not a problem at all. If you e.g. need some ints and some floats per primitive, you could either use different TBOs, one for each type, or with modern GLSL (>=3.30), you could use an integer type for the TBO and reinterpret the integer bits as floating point with intBitsToFloat(), so you can get around that limitation, too.
You can use one element in the vertex array for rendering multiple vertices. It's called instanced vertex attributes.

What is the purpose of OpenGL texture buffer objects?

We use buffer objects for reducing copy operations from CPU-GPU and for texture buffer objects we can change target from vertex to texture in buffer objects. Is there any other advantage here of texture buffer objects? Also, it does not allow filtering, is there any disadvantage of this?
A buffer texture is similar to a 1D-texture but has a backing buffer store that's not part of the texture object (in contrast to any other texture object) but realized with an actual buffer object bound to TEXTURE_BUFFER. Using a buffer texture has several implications and, AFAIK, one use-case that can't be mapped to any other type of texture.
Note that a buffer texture is not a buffer object - a buffer texture is merely associated with a buffer object using glTexBuffer.
By comparison, buffer textures can be huge. Table 23.53 and following of the core OpenGL 4.4 spec defines a minimum maximum (i.e. the minimal value that implementations must provide) number of texels MAX_TEXTURE_BUFFER_SIZE. The potential number of texels being stored in your buffer object is computed as follows (as found in GL_ARB_texture_buffer_object):
floor(<buffer_size> / (<components> * sizeof(<base_type>))
The resulting value clamped to MAX_TEXTURE_BUFFER_SIZE is the number of addressable texels.
Example:
You have a buffer object storing 4MiB of data. What you want is a buffer texture for addressing RGBA texels, so you choose an internal format RGBA8. The addressable number of texels is then
floor(4MiB / (4 * sizeof(UNSIGNED_BYTE)) == 1024^2 texels == 2^20 texels
If your implementation supports this number, you can address the full range of values in your buffer object. The above isn't too impressive and can simply be achieved with any other texture on current implementations. However, the machine on which I'm writing this answer supports 2^28 == 268435456 texels.
With OpenGL 4.4 (and 4.3 and possibly with earlier 4.x versions), the MAX_TEXTURE_SIZE is 2 ^ 16 texels per 1D-texture, so a buffer texture can still be 4 times as large. On my local machine I can allocate a 2GiB buffer texture (even larger actually), but only a 1GiB 1D-texture when using RGBAF32 texels.
A use-case for buffer textures is random (and atomic, if desired) read-/write-access (the latter via image load/store) to a large data store inside a shader. Yes, you can do random read-access on arrays of uniforms inside one or multiple blocks but it get's very tedious if you have to process a lot of data and have to work with multiple blocks and even then, looking at the maximum combined size of all uniform components (where a single float component has a size of 4 bytes) in all uniform blocks for a single stage,
MAX_(stage)_UNIFORM_BLOCKS *
MAX_UNIFORM_BLOCK_SIZE +
MAX_(stage)_UNIFORM_COMPONENTS * 4
isn't really a lot of space to work with in a shader stage (depending on how large your implementation allows the above number to be).
An important difference between textures and buffer textures is that the data store, as a regular buffer object, can be used in operations where a texture simply does not work. The extension mentions:
The use of a buffer object to provide storage allows the texture data to
be specified in a number of different ways: via buffer object loads
(BufferData), direct CPU writes (MapBuffer), framebuffer readbacks
(EXT_pixel_buffer_object extension). A buffer object can also be loaded
by transform feedback (NV_transform_feedback extension), which captures
selected transformed attributes of vertices processed by the GL. Several
of these mechanisms do not require an extra data copy, which would be
required when using conventional TexImage-like entry points.
An implication of using buffer textures is that look-ups inside a shader can only be done via texelFetch. Buffer textures also aren't mip-mapped and, as you already mentioned, during fetches there is no filtering.
Addendum:
Since OpenGL 4.3, we have what is called a
Shader Storage Buffer. These too provide random (atomic) read-/write-access to a large data store but don't need to be accessed with texelFetch() or image load/store functions as is the case for buffer textures. Using buffer textures also implies having to deal with gvec4 return values, both with texelFetch() and imageLoad() / imageStore(). This becomes very tedious as soon as you want to work with structures (or arrays thereof) and you don't want to think of some stupid packing scheme using multiple instances of vec4 or using multiple buffer textures to achieve something similar. With a buffer accessed as shader storage, you can simple index into the data store and pull one or more instances of some struct {} directly from the buffer.
Also, since they are very similar to uniform blocks, using them should be fairly straight forward - if you know how to use uniform buffers, you don't have a long way to go learn how to use shader storage buffers.
It's also absolutely worth browsing the Issues section of the corresponding ARB extension.
Performance Implications
Daniel Rakos did some performance analysis years ago, both as a comparison of uniform buffers and buffer textures, and also on a little more general note based on information from AMD's OpenCL programming guide. There is now a very recent version, specifically targeting OpenCL optimization an AMD platforms.
There are many factors influencing performance:
access patterns and resulting caching behavior
cache line sizes and memory layou
what kind of memory is accessed (registers, local, global, L1/L2 etc.) and its respective memory bandwidth
how well memory fetching latency is hidden by doing something else in the meantime
what kind of hardware you're on, i.e. a dedicated graphics card with dedicated memory or some unified memory architecture
etc., etc.
As always when worrying about performance: implement something that works and see if that solutions is fast enough for your needs. Otherwise, implement two or more approaches to solving the problem, profile them and compare.
Also, vendor specific guides can offer a great deal of insight. The above mentioned OpenCL user and optimization guides provide a high-level architectural perspective and specific hints on how to optimize your CL kernels - stuff that's also relevant when developing shaders.
A one use case I have found was to store per primitive attributes (accessed in the fragment shader with help of gl_PrimitiveID) while still maintaining unique vertices in the indexed mesh.