Behavior of uniforms after glUseProgram() and speed - c++

How fast is glUseProgram()? Is there anything better (faster)?:
Here are my thoughts:
Use 1 universal shader program, but with many input settings and attributes (settings for each graphics class)
Use more than 1 shader for each graphics class
What state are uniforms in after changing the shader program? Do they save values (for example, values of matrices)?
Here are what I consider the benefits of #1 to be:
Doesn't use glUseProgram()
And the benefits of #2:
No matrix changes (for example, if class Menu and class Scene3D have different Projection matrices)

What of the two options is better largely depends on what those shaders do, how different they are and how many attributes/uniforms you set and how often they are changed. There is no one right answer for all cases.
That said: Keep in mind, there is not only the cost for state changes, but also a shader runtime cost and it is payed per vertex and per fragment. So keeping the complexity of the shader low is always a good idea and a universal shader is more complex than specialised ones.
Minimize state change. If you have objects A, C, E using Program X and B, D, F using Program Y then, all else being equal, render in order ACEBDF, not ABCDEF.
Regarding the last question: Programs retain their state, and thus the values of uniforms, over their lifetime, unless you relink them. But uniforms are per program state, which means that if you have two uniforms with the same name and type in different programs, values won't carry over from one program to another.

Related

Is there a faster alternative to geometry shaders that can render points as a specific number of triangles?

I'm currently using openGL with a geometry shader to take points and convert them to triangles during rendering.
I have n lists of points that will each be rendered as n triangles (first list of points each becomes one triangle, second becomes two triangles, etc). I've tried swapping geometry shaders for each of these lists with max_vertices being the minimum for each list. With OpenGL I seemingly have no control over how this is ultimately implemented on the GPU via the geometry shader and some drivers seem to handle it very slowly while others are very fast.
Is there any way to perform this specific task optimally, ideally taking advantage of the fact that I know the exact number of desired output triangles per element and in total? I would be happy to use some alternative to geometry shaders for this if possible. I would also be happy to try Vulkan if it can do the trick.
What you want is arbitrary amplification of geometry: taking one point primitive and producing arbitrarily many entirely separate primitives from it. And the tool GPUs have for that is geometry shaders (or just using a compute shader to generate your vertex data manually, but that's probably not faster and definitely more memory consuming).
While GS's are not known for performance, there is one way you might be able to speed up what you're doing. Since all of the primitives in a particular call will generate a specific number of primitives, you can eschew having each GS output more than one primitive by employing vertex instanced rendering.
Here, you use glDrawArraysInstanced. Your VS needs to pass gl_InstanceID to the GS, which can use that to figure out which triangle to generate from the vertex. That is, instead of having a loop over n to generate n triangles, the GS only generates one triangle. But it gets called instanceCount times, and each call should generate the gl_InstanceIDth triangle.
Now, one downside of this is that the order of triangles generated will be different. In your original GS code, where each GS generates all of the triangles from a point, all of the triangles from one point will be rendered before rendering any triangles from another point. With vertex instancing, you get one triangle from all of the points, then it produces another triangle from all the points, etc. If rendering order matters to you, then this won't work.
If that's important, then you can try geometry shader instancing instead. This works similarly to vertex instancing, except that the instance count is part of the GS. Each GS invocation is only responsible for a single triangle, and you use gl_InvocationID to decide which triangle to use it on. This will ensure that all primitives from one set of GS instances will be rendered before any primitives from a different set of GS instances.
The downside is what I said: the instance count is part of the GS. Unlike instanced rendering, the number of instances is baked into the GS code itself. So you will need a separate program for every count of triangles you work with. SPIR-V specialization constants make it a bit easier on you to build those programs, but you still need to maintain (and swap) multiple programs.
Also, while instanced rendering has no limit on the number of instances, GS's do have a limit. And that limit can be as small as 32 (which is a very popular number).

How to send my model matrix only once per model to shaders

For reference, I'm following this tutorial. Now suppose I have a little application with multiple types of model, if I understand correctly I have to send my MPV matrix from the CPU to the GPU (in other words to my vertex shader) for each model, because each model might have a different model matrix from one to another.
Now looking at the tutorial and this post, I understand that the call to send the matrix to my shader (glUniformMatrix4fv(myMatrixID, 1, GL_FALSE, &myModelMVP[0][0])) should be done for each frame and for each model since each time it overwrites the previous value of my MVP (the one for my last model). But, being concerned about the performance of my app, I don't want to send useless data through the bus and if I understand correctly, my model matrix is constant for each model.
I'm thinking about having an uniform for each model's MVP matrix, but I think it is not scalable and I would also have to update all of them if my view or projection matrices changed... Is there a way to avoid sending multiple times my model matrices and only send my view and projection matrices upon change?
These are essentially two questions: how to avoid sending data when only part of a transformation sequence changes, and how to efficiently supply per-model data which may or may not have changed since the last frame.
Transformation Sequence
For the first, you have a transformation sequence. Your positions are in model space. You then conceptually transform them into world space, then to camera/view space, then finally to clip space, where you write the position to gl_Position.
Most of these transformations are constant throughout a frame, but may change on a frame-to-frame basis. So you want to avoid changing data that doesn't strictly need to be changed.
If you want to do this, then clearly you cannot provide an "MVP" matrix. That is, you should not have a single matrix that contains the whole transformation. You should instead have a matrix that represents particular parts of the transformation.
However, you will need to do this decomposition for reasons other than performance. You cannot do many lighting operations in clip-space; as a non-linear space, it messes up lots of lighting operations. Therefore, if you're going to do lighting at all, you need a transformation that stops before clip space.
Camera/view space is the most common stopping point for lighting computations.
Now, if you use model-to-camera and camera-to-clip, then the model-to-camera matrix for every model will change when the camera changes, even if the model itself has not moved. And therefore, you may need to upload a bunch of matrices that don't strictly need to be changed.
To avoid that, you would need to use model-to-world and world-to-clip (in this case, you do your lighting in world space). The issue here is that you are exposed to the perils of world space Numerical precision may become problematic.
But is there a genuine performance issue here? Obviously it somewhat depends on the hardware. However, consider that many applications have hundreds if not thousands of objects, each with matrices that change every frame. An animated character usually has over a hundred matrices just for themselves that change every frame.
So it seems unlikely that the performance cost of uploading a few matrices that could have been constant is a real-world problem.
Per-Object Storage
What you really want is to separate your storage of per-object data from the program object itself. This can be done with UBOs or SSBOs; in both cases, you're storing uniform data in buffer objects.
The former are typically smaller in size (64KB or so), while the latter are essentially unbounded in their storage (16MB minimum). Obviously, the former are typically faster to access, but SSBOs shouldn't be considered to be slow.
Each object would have a section of the buffer that gets used for per-object data. And thus, you could choose to change it or not as you see fit.
Even so, such a system does not guarantee faster performance. For example, if the implementation is still reading from that buffer from last frame when you try to change it this frame, the implementation will have to either allocate new memory or just wait until the GPU is finished. This is not a hypothetical possibility; GPU rendering for complex scenes frequently lags a frame behind the CPU.
So to avoid that, you would need to double-buffer your per-object data. But when you do that, you will have to always upload their data, even if it doesn't change. Why? Because it might have changed two frames ago, and your double buffer has old data in it.
Basically, your goal of trying to avoid uploading of sometimes-static per-model data is just as likely to harm performance as to help it.
First of all, it's likely that at least something in your scene moves. If it is the objects then the model matrix will change from frame to frame, if it is the camera then the view or projection matrix will change. MVP includes the composition of the three, so it actually will change anyways and you can't get away from updating it in one way or the other.
However, you may still benefit from employing some of these:
Use Uniform Buffer Objects. You can send the uniforms to the GPU only once, and then rebind the buffer that the program will read the uniforms from. So different models may use different UBOs for their parameters (like model matrix).
Use Instancing. Even if you render only one instance of every model, you can pass the model matrix as an instanced vertex attribute. It will be stored in the VAO, and so sent to the GPU only once (or when you have to update it). On the plus side you may now easily render multiple instances of the same model through instanced draw calls.
Note it might be beneficial to separate the model, view and projection matrices. View and projection might be passed through a 'camera description' uniform buffer object updated only once per frame, then referenced by all programs. Model matrix, if it isn't changed, then will be constant within the VAO. To do proper lighting you have to separate model-view from projection anyways. It might look intimidating to work with three matrices on the GPU, but you actually don't have to, as you may switch to quaternion-based pipeline instead, which in turn simplifies such things like tangent space interpolation.
Two words: Premature Optimization!
I don't want to send useless data through the bus and if I understand correctly, my model matrix is constant for each model.
The amount of data transmitted is insignificant. A 4×4 matrix of single precision floats takes up 64 bytes. For all intents and purposes this is practically nothing. Heck it takes more data to issue the actual drawing commands to the GPU (and usually uniform value changes are packed into the same bus transaction as the drawing commands).
I'm thinking about having an uniform for each model's MVP matrix
Then you're going to run out of uniforms. There's only so many uniform locations a GPU is required to support. You could of course use a uniform buffer object, but that's hardly the right application for that.

Occlusion Queries and Instanced Rendering

I'm facing a problem where the use of an occlusion query in combination with instanced rendering would be desirable.
As far as I understood, something like
glBeginQuery(GL_ANY_SAMPLES_PASSED, occlusionQuery);
glDrawArraysInstanced(mode, i, j, countInstances);
glEndQuery(GL_ANY_SAMPLES_PASSED);
will only tell me, if any of the instances were drawn.
What I would need to know is, what set of instances has been drawn (giving me the IDs of all visible instances). Drawing each instance in an own call is no option for me.
An alternative would be to color-code the instances and detect the visible instances manually.
But is there really no way to solve this problem with a query command and why would it not be possible?
It's not possible for several reasons.
Query objects only contain a single counter value. What you want would require a separate sample passed count for each instance.
Even if query objects stored arrays of sample counts, you can issue more than one draw call in the begin/end scope of a query. So how would OpenGL know which part of which draw call belonged to which query value in the array? You can even change other state within the query scope; uniform bindings, programs, pretty much anything.
The samples-passed count is determined entirely by the rasterizer hardware on the GPU. And the rasterizer neither knows nor cares which instance generated a triangle.
Instancing is a function of the vertex processing and/or vertex specification stages; by the time the rasterizer sees it, that information is gone. Notice that fragment shaders don't even get an instance ID as input, unless you explicitly create one by passing it from your vertex processing stage(s).
However, if you truly want to do this you could use image load/store and its atomic operations. That is, pass the fragment shader the instance in question (as an int data type, with flat interpolation). This FS also uses a uimageBuffer buffer texture, which uses the GL_R32UI format (or you can use an SSBO unbounded array). It then performs an imageAtomicAdd, using the instance value passed in as the index to the buffer. Oh, and you'll need to have the FS explicitly require early tests, so that samples which fail the fragment tests will not execute.
Then use a compute shader to build up a list of rendering commands for the instances which have non-zero values in the array. Then use an indirect rendering call to draw the results of this computation. Now obviously, you will need to properly synchronize access between these various operations. So you'll need to use appropriate glMemoryBarrier calls between each one.
Even if queries worked the way you want them to, this would be overall far more preferable than using a query object. Unless you're reading a query into a buffer object, reading a query object requires a GPU/CPU synchronization of some form. Whereas the above requires some synchronization and barrier operations, but they're all on-GPU operations, rather than synchronizing with the CPU.

Under what conditions does a multi-pass approach become strictly necessary?

I'd like to enumerate those general, fundamental circumstances under which multi-pass rendering becomes an unavoidable necessity, as opposed to keeping everything within the same shader program. Here's what I've come up with so far.
When a result requires non-local fragment information (i.e. context) around the current fragment, e.g. for box filters, then a previous pass must have supplied this;
When a result needs hardware interpolation done by a prior pass;
When a result acts as pre-cache of some set of calculations that enables substantially better performance than simply (re-)working through the entire set of calculations in those passes that use them, e.g. transforming each fragment of the depth buffer in a particular and costly way, which multiple later-pass shaders can then share, rather than each repeating those calculations. So, calculate once, use more than once.
I note from my own (naive) deductions above that vertex and geometry shaders don't really seem to come into the picture of deferred rendering, and so are probably usually done in first pass; to me this seems sensible, but either affirmation or negation of this, with detail, would be of interest.
P.S. I am going to leave this question open to gather good answers, so don't expect quick wins!
Nice topic. For me since I'm a beginner I would say to avoid unnecessary calculations in the pixel/fragment shader you get when you use forward rendering.
With forward rendering you have to do a pass for every light you have in your scene, even if the pixel colors aren't affected.
But that's just a comparison between forward rendering and deferred rendering.
As opposed to keeping everything in the same shader program, the simplest thing I can think of is the fact that you aren't restricted to use N number of lights in your scene, since in for instance GLSL you can use either separate lights or store them in a uniform array. Then again you can also use forward rendering, but if you have a lot of lights in your scene forward rendering has a too expensive pixel/fragment shader.
That's all I really know so I would like to hear other theories as well.
Deferred / multi-pass approaches are used when the results of the depth buffer are needed (produced by rendering basic geometry) in order to produce complex pixel / fragment shading effects based on depth, such as:
Edge / silhouette detection
Lighting
And also application logic:
GPU picking, which requires the depth buffer for ray calculation, and uniquely-coloured / ID'ed geometries in another buffer for identification of "who" was hit.

Is there an impact to do not use an activated Attrib?

Should I disable shaders attributes when switching to a program shader which uses less (or different locations of) attributes?
I enable and disable these attributes with glEnableVertexAttribArray()/glDisableVertexAttribArray().
Is there any performance impact, or could it bring some bugs, or doing enable/disable will be slower than activate all attributes and let them activated ?
The OP most likely understands the first part already, but let me just reiterate some points on vertex attributes to set the basis for the more interesting part. I'll assume that all vertex data comes from buffers, and not talk about the case where calls like glVertexAttrib3f() are used to set "constant" values for attributes.
The glEnableVertexAttribArray() and glVertexAttribPointer() calls specify which vertex attributes are enabled, and describe how the GPU should retrieve their values. This includes their location in memory, how many components they have, their type, stride, etc. I'll call the collected state specified by these calls "vertex attribute state" in the rest of this answer.
The vertex attribute state is not part of the shader program state. It lives in Vertex Attribute Objects (VAOs), together with some other related state. Therefore, binding a different program changes nothing about the vertex attribute state. Only binding a different VAO does, or of course making one of the calls above.
Vertex attributes are tied to attribute/in variables in the vertex shader by setting the location of the in variables. This specifies which vertex attribute the value of each in variable should come from. The location value is part of the program state.
Based on this, when binding a different program, it is necessary that the locations of the in variables are properly set to refer to the desired attribute. As long as the same attribute is always used for the shader, this has to be done only once while building the shader. Beyond that, all the attributes used by the shader have to be enabled with glEnableVertexAttribArray(), or by binding a VAO that contains the state.
Now, finally coming to the core of the question: What happens if attributes that are not used by the program are enabled?
I believe that having unused attributes enabled is completely legal. At least I've never seen anything in the spec that says otherwise. I just checked again, and still found nothing. Therefore, there will be no bugs resulting from having unused attributes enabled.
Does it hurt performance? The short answer is that it possibly could. Let's look at two hypothetical hardware architectures:
Architecture A has reading of vertex attribute values baked into the vertex shader code.
Architecture B has a fixed function unit that reads vertex attribute values. This fixed function unit is controlled by the vertex attribute state, and writes the values into on-chip memory, where vertex shader instances pick them up.
With architecture A, having unused attributes enabled would have no effect at all. They would simply never be read.
With architecture B, the fixed function unit might read the unused attributes. The vertex shader would end up not using them, but they could still be read from main/video memory into on-chip memory. The driver could avoid that by checking which attributes are used by the current shader, and set up the fixed function unit with only those attributes. The downside is that the state setup for the fixed function unit has to be checked/updated every time a new shader is bound, which is otherwise unnecessary. But it prevents reading unused attributes from memory.
Going one step farther, let's say we do end up reading unused attributes from memory. If and how much this hurts is impossible to answer in general. Intuitively, I would expect it to matter very little if the attributes are interleaved, and the unused attributes are in the same cache lines as used attributes. On the other hand, if reading unused attributes causes extra cache misses, it would at least use memory bandwidth, and consume power.
In summary, I don't believe there's a clear and simple answer. Chances are that having unused attributes enabled will not hurt at all, or very little. But I would personally disable them anyway. There is a potential that it might make a difference, and it's very easy to do. Particularly if you use VAOs, you can generally set up the whole vertex attribute state with a single glBindVertexArray() call, so enabling/disabling exactly the needed attributes does not require additional API calls.