How to send my model matrix only once per model to shaders - opengl

For reference, I'm following this tutorial. Now suppose I have a little application with multiple types of model, if I understand correctly I have to send my MPV matrix from the CPU to the GPU (in other words to my vertex shader) for each model, because each model might have a different model matrix from one to another.
Now looking at the tutorial and this post, I understand that the call to send the matrix to my shader (glUniformMatrix4fv(myMatrixID, 1, GL_FALSE, &myModelMVP[0][0])) should be done for each frame and for each model since each time it overwrites the previous value of my MVP (the one for my last model). But, being concerned about the performance of my app, I don't want to send useless data through the bus and if I understand correctly, my model matrix is constant for each model.
I'm thinking about having an uniform for each model's MVP matrix, but I think it is not scalable and I would also have to update all of them if my view or projection matrices changed... Is there a way to avoid sending multiple times my model matrices and only send my view and projection matrices upon change?

These are essentially two questions: how to avoid sending data when only part of a transformation sequence changes, and how to efficiently supply per-model data which may or may not have changed since the last frame.
Transformation Sequence
For the first, you have a transformation sequence. Your positions are in model space. You then conceptually transform them into world space, then to camera/view space, then finally to clip space, where you write the position to gl_Position.
Most of these transformations are constant throughout a frame, but may change on a frame-to-frame basis. So you want to avoid changing data that doesn't strictly need to be changed.
If you want to do this, then clearly you cannot provide an "MVP" matrix. That is, you should not have a single matrix that contains the whole transformation. You should instead have a matrix that represents particular parts of the transformation.
However, you will need to do this decomposition for reasons other than performance. You cannot do many lighting operations in clip-space; as a non-linear space, it messes up lots of lighting operations. Therefore, if you're going to do lighting at all, you need a transformation that stops before clip space.
Camera/view space is the most common stopping point for lighting computations.
Now, if you use model-to-camera and camera-to-clip, then the model-to-camera matrix for every model will change when the camera changes, even if the model itself has not moved. And therefore, you may need to upload a bunch of matrices that don't strictly need to be changed.
To avoid that, you would need to use model-to-world and world-to-clip (in this case, you do your lighting in world space). The issue here is that you are exposed to the perils of world space Numerical precision may become problematic.
But is there a genuine performance issue here? Obviously it somewhat depends on the hardware. However, consider that many applications have hundreds if not thousands of objects, each with matrices that change every frame. An animated character usually has over a hundred matrices just for themselves that change every frame.
So it seems unlikely that the performance cost of uploading a few matrices that could have been constant is a real-world problem.
Per-Object Storage
What you really want is to separate your storage of per-object data from the program object itself. This can be done with UBOs or SSBOs; in both cases, you're storing uniform data in buffer objects.
The former are typically smaller in size (64KB or so), while the latter are essentially unbounded in their storage (16MB minimum). Obviously, the former are typically faster to access, but SSBOs shouldn't be considered to be slow.
Each object would have a section of the buffer that gets used for per-object data. And thus, you could choose to change it or not as you see fit.
Even so, such a system does not guarantee faster performance. For example, if the implementation is still reading from that buffer from last frame when you try to change it this frame, the implementation will have to either allocate new memory or just wait until the GPU is finished. This is not a hypothetical possibility; GPU rendering for complex scenes frequently lags a frame behind the CPU.
So to avoid that, you would need to double-buffer your per-object data. But when you do that, you will have to always upload their data, even if it doesn't change. Why? Because it might have changed two frames ago, and your double buffer has old data in it.
Basically, your goal of trying to avoid uploading of sometimes-static per-model data is just as likely to harm performance as to help it.

First of all, it's likely that at least something in your scene moves. If it is the objects then the model matrix will change from frame to frame, if it is the camera then the view or projection matrix will change. MVP includes the composition of the three, so it actually will change anyways and you can't get away from updating it in one way or the other.
However, you may still benefit from employing some of these:
Use Uniform Buffer Objects. You can send the uniforms to the GPU only once, and then rebind the buffer that the program will read the uniforms from. So different models may use different UBOs for their parameters (like model matrix).
Use Instancing. Even if you render only one instance of every model, you can pass the model matrix as an instanced vertex attribute. It will be stored in the VAO, and so sent to the GPU only once (or when you have to update it). On the plus side you may now easily render multiple instances of the same model through instanced draw calls.
Note it might be beneficial to separate the model, view and projection matrices. View and projection might be passed through a 'camera description' uniform buffer object updated only once per frame, then referenced by all programs. Model matrix, if it isn't changed, then will be constant within the VAO. To do proper lighting you have to separate model-view from projection anyways. It might look intimidating to work with three matrices on the GPU, but you actually don't have to, as you may switch to quaternion-based pipeline instead, which in turn simplifies such things like tangent space interpolation.

Two words: Premature Optimization!
I don't want to send useless data through the bus and if I understand correctly, my model matrix is constant for each model.
The amount of data transmitted is insignificant. A 4×4 matrix of single precision floats takes up 64 bytes. For all intents and purposes this is practically nothing. Heck it takes more data to issue the actual drawing commands to the GPU (and usually uniform value changes are packed into the same bus transaction as the drawing commands).
I'm thinking about having an uniform for each model's MVP matrix
Then you're going to run out of uniforms. There's only so many uniform locations a GPU is required to support. You could of course use a uniform buffer object, but that's hardly the right application for that.

Related

Can I use glUniform4fv to set an individual column of a shader matrix?

Is it possible to obtain the location of a glsl mat4 individual column? I want to update the value of an individual column of a matrix defined in the shader without actually having the set the whole matrix uniform.
I have a game where the camera orientation stays often the same but the translation part changes frequently. My idea was to only update the affected translation part of the VP matrix (projection * view) in order to squeeze some performance.
You cannot. You can no more set only one column of a matrix uniform than you can set the high two bytes of a uint uniform without setting the other two bytes too. When it comes to uniforms, matrices are just as much a basic type as a vector or scalar.
My idea was to only update the translation part of the view matrix in order to squeeze some performance.
This will not do that. The performance of doing an extremely small CPU-to-GPU memory transfer will be dominated by the overhead of doing any CPU-to-GPU transfer. That is, the cost to transfer 16 bytes will be basically identical to the cost of transferring 64 bytes. The amount of data transferred only becomes significant to the cost of the transfer when that amount starts getting large (kilo/mega bytes).
So this is a waste of time. Just transfer the matrix and move on. Premature optimization is the root of all evil.
Is it possible to obtain the location of a glsl mat4 individual column?
No.
For a mat uniform variable type, you need to use the appropriate glUnfiformMatrix...() call, and you cannot update individual parts of it:
Possible alternatives:
Use an Uniform Buffer Object where you can indivudally control every single byte, as already suggested by #Rabbid76's comment.
Use uniform vec4 mymat[4] instead of mat4 and construct the matrix in the shader (if needed) or directly use the individual column vectors for the calculations.
no you can't. generally you don't need to do this because of modern hardware do this kind of cpu to gpu transfer so fast so there is no sensible difference between sending a column or a whole matrix. but in my case (in embedded systems using opengl es) i got a bottleneck in the similar situation. so i change uniforms to attributes. each row of matrix would become a vec4 attribute and we use VBO to send data. frequent data changing would accomplished by buffer streaming. i don't say this is a good way and will increase performance (hardwares are diffrent) but in my case it works fine.

Occlusion Queries and Instanced Rendering

I'm facing a problem where the use of an occlusion query in combination with instanced rendering would be desirable.
As far as I understood, something like
glBeginQuery(GL_ANY_SAMPLES_PASSED, occlusionQuery);
glDrawArraysInstanced(mode, i, j, countInstances);
glEndQuery(GL_ANY_SAMPLES_PASSED);
will only tell me, if any of the instances were drawn.
What I would need to know is, what set of instances has been drawn (giving me the IDs of all visible instances). Drawing each instance in an own call is no option for me.
An alternative would be to color-code the instances and detect the visible instances manually.
But is there really no way to solve this problem with a query command and why would it not be possible?
It's not possible for several reasons.
Query objects only contain a single counter value. What you want would require a separate sample passed count for each instance.
Even if query objects stored arrays of sample counts, you can issue more than one draw call in the begin/end scope of a query. So how would OpenGL know which part of which draw call belonged to which query value in the array? You can even change other state within the query scope; uniform bindings, programs, pretty much anything.
The samples-passed count is determined entirely by the rasterizer hardware on the GPU. And the rasterizer neither knows nor cares which instance generated a triangle.
Instancing is a function of the vertex processing and/or vertex specification stages; by the time the rasterizer sees it, that information is gone. Notice that fragment shaders don't even get an instance ID as input, unless you explicitly create one by passing it from your vertex processing stage(s).
However, if you truly want to do this you could use image load/store and its atomic operations. That is, pass the fragment shader the instance in question (as an int data type, with flat interpolation). This FS also uses a uimageBuffer buffer texture, which uses the GL_R32UI format (or you can use an SSBO unbounded array). It then performs an imageAtomicAdd, using the instance value passed in as the index to the buffer. Oh, and you'll need to have the FS explicitly require early tests, so that samples which fail the fragment tests will not execute.
Then use a compute shader to build up a list of rendering commands for the instances which have non-zero values in the array. Then use an indirect rendering call to draw the results of this computation. Now obviously, you will need to properly synchronize access between these various operations. So you'll need to use appropriate glMemoryBarrier calls between each one.
Even if queries worked the way you want them to, this would be overall far more preferable than using a query object. Unless you're reading a query into a buffer object, reading a query object requires a GPU/CPU synchronization of some form. Whereas the above requires some synchronization and barrier operations, but they're all on-GPU operations, rather than synchronizing with the CPU.

Under what conditions does a multi-pass approach become strictly necessary?

I'd like to enumerate those general, fundamental circumstances under which multi-pass rendering becomes an unavoidable necessity, as opposed to keeping everything within the same shader program. Here's what I've come up with so far.
When a result requires non-local fragment information (i.e. context) around the current fragment, e.g. for box filters, then a previous pass must have supplied this;
When a result needs hardware interpolation done by a prior pass;
When a result acts as pre-cache of some set of calculations that enables substantially better performance than simply (re-)working through the entire set of calculations in those passes that use them, e.g. transforming each fragment of the depth buffer in a particular and costly way, which multiple later-pass shaders can then share, rather than each repeating those calculations. So, calculate once, use more than once.
I note from my own (naive) deductions above that vertex and geometry shaders don't really seem to come into the picture of deferred rendering, and so are probably usually done in first pass; to me this seems sensible, but either affirmation or negation of this, with detail, would be of interest.
P.S. I am going to leave this question open to gather good answers, so don't expect quick wins!
Nice topic. For me since I'm a beginner I would say to avoid unnecessary calculations in the pixel/fragment shader you get when you use forward rendering.
With forward rendering you have to do a pass for every light you have in your scene, even if the pixel colors aren't affected.
But that's just a comparison between forward rendering and deferred rendering.
As opposed to keeping everything in the same shader program, the simplest thing I can think of is the fact that you aren't restricted to use N number of lights in your scene, since in for instance GLSL you can use either separate lights or store them in a uniform array. Then again you can also use forward rendering, but if you have a lot of lights in your scene forward rendering has a too expensive pixel/fragment shader.
That's all I really know so I would like to hear other theories as well.
Deferred / multi-pass approaches are used when the results of the depth buffer are needed (produced by rendering basic geometry) in order to produce complex pixel / fragment shading effects based on depth, such as:
Edge / silhouette detection
Lighting
And also application logic:
GPU picking, which requires the depth buffer for ray calculation, and uniquely-coloured / ID'ed geometries in another buffer for identification of "who" was hit.

How do I deal with many variables per triangle in OpenGL?

I'm working with OpenGL and am not totally happy with the standard method of passing values PER TRIANGLE (or in my case, quads) that need to make it to the fragment shader, i.e., assign them to each vertex of the primitive and pass them through the vertex shader to presumably be unnecessarily interpolated (unless using the "flat" directive) in the fragment shader (so in other words, non-varying per fragment).
Is there some way to store a value PER triangle (or quad) that needs to be accessed in the fragment shader in such a way that you don't need redundant copies of it per vertex? Is so, is this way better than the likely overhead of 3x (or 4x) the data moving code CPU side?
I am aware of using geometry shaders to spread the values out to new vertices, but I heard geometry shaders are terribly slow on non up to date hardware. Is this the case?
OpenGL fragment language supports the gl_PrimitiveID input variable, which will be the index of the primitive for the currently processed fragment (starting at 0 for each draw call). This can be used as an index into some data store which holds per-primitive data.
Depending on the amount of data that you will need per primitive, and the number of primitives in total, different options are available. For a small number of primitives, you could just set up a uniform array and index into that.
For a reasonably high number of primitives, I would suggest using a texture buffer object (TBO). This is basically an ordinary buffer object, which can be accessed read-only at random locations via the texelFetch GLSL operation. Note that TBOs are not really textures, they only reuse the existing texture object interface. Internally, it is still a data fetch from a buffer object, and it is very efficient with none of the overhead of the texture pipeline.
The only issue with this approach is that you cannot easily mix different data types. You have to define a base data type for your TBO, and every fetch will get you the data in that format. If you just need some floats/vectors per primitive, this is not a problem at all. If you e.g. need some ints and some floats per primitive, you could either use different TBOs, one for each type, or with modern GLSL (>=3.30), you could use an integer type for the TBO and reinterpret the integer bits as floating point with intBitsToFloat(), so you can get around that limitation, too.
You can use one element in the vertex array for rendering multiple vertices. It's called instanced vertex attributes.

How does interleaved vertex submission help performance?

I have read and seen other questions that all generally point to the suggestion to interleav vertex positions and colors, etc into one array, as this minimizes the data that gets sent from cpu to gpu.
What I'm not clear on is how OpenGL does this when, even with an interleaved array, you must still make separate GL calls for position and color pointers. If both pointers use the same array, just set to start at different points in that array, does the draw call not copy the array twice since it was the object of two different pointers?
This is mostly about cache. For example, imagine we have 4 vertex and 4 colors. You can provide the information this way (excuse me but I don't remember the exact function names)
glVertexPointer(..., vertex);
glColorPointer(..., colors);
What it internally does, is read vertex[0], then apply colors[0], then again vertex[1] with colors[1]. As you can see, if vertex is, for example, 20 megs long, vertex[0] and colors[0] will be, to say the least, 20 megabytes apart from each other.
Now, on the other hand, if you provide a structure like { vertex0, color0, vertex1, color1, etc.} there will be a lot of cache hits because, well, vertex0 and color0 are together, and so are vertex1 and color1.
Hope this helps answer the question
edit: on second read, I may not have answered the question. You might probably be wondering how does OpenGL know which values to read from that structure, maybe? Like I said before with a structure such as { vertex, color, vertex, color } you tell OpenGL that vertex is at position 0, with an offset of 2 (so next one will be at position 2, then 4, etc) and color starts at position 1, with an offset of 2 also (so position 1, then 3, etc).
addition: In case you want a more practical example, look at this link http://www.lwjgl.org/wiki/index.php?title=Using_Vertex_Buffer_Objects_(VBO). You can see there how it only provides the buffer once and then uses offsets to render efficiently.
I suggest reading: Vertex_Specification_Best_Practices
h4lc0n provided quite nice explanation, but I would like add some additional info:
interleaved data can actually hurt performance when your data often changes. For instance when you change position of point sprites, you update POS, but COLOR and TEXCOORD are usually the same. Then, when data is interleaved you must "touch" additional data. In that case it would be better to have one VBO for POS only (or in general for data that changes often) and the second VBO for data that is constant.
it is not easy to give strict rules about VBO layout, since it is very vendor/driver specific. Also your usage can be different from others. In general it is needed to make some benchmarks for your particular test cases
You could also make an argument for separating different attributes. Assuming a GPU does not process one vertex after another but rather a bunch (ex. 16) of them in parallel, you would would get something like this while executing a vertex shader:
read attribute A for all 16 vertices
perform some computations
read attribute B for all 16 vertices
perform some more computations
....
So you read one attribute for many vertices at once. From this reasoning it would seem that interleaving the attributes actually hurts the performance. Of cours this would only be visible if you are either bandwidth constrained or if the memory latency cannot be hidden for some reason (ex. a complex shader that requires many registers will reduce the number of vertices that can be in flight at a given time).