Organizing GLSL shaders in OpenGL engine - c++

Which is better ?
To have one shader program with a lot of uniforms specifying
lights to use, or mappings to do (e.g. I need one mesh to be parallax mapped, and another one parallax/specular mapped). I'd make a cached list of uniforms for lazy transfers, and just change a couple of uniforms for every next mesh if it needs to do so.
To have a lot of shader programs for every needed case, each one with small amount of uniforms, and do the lazy bind with glUseProgram for every mesh if it needs to do so. Here I assume that meshes are properly batched, to avoid redundant switches.

Most modern engines I know have a "shader cache" and use the second option, because apparently it's faster.
Also you can take a look at the ARB_shader_subroutine which allows dynamic linkage. But I think it's only available on DX11 class hardware.

Generally, option 2 will be faster/better unless you have a truly huge number of programs. You can also use buffer objects shared across programs so that you need not reset any values when you change programs.
In addition, once you link a program, you can free all of the shaders that you linked into the program. This will free up all the source code and any pre-link info the driver is keeping around, leaving just the fully-linked program in memory.

I would tend to believe that it depends on the specific application. And yes since it would be more efficient to say run 100 programs where they each may have about 2-16 uniforms each; it may be better to have a trade off of the two. I would tend to think that having say maybe 10 - 20 programs for your most common shading techniques would be sufficient or a few more. For example you might want to have one program / shader to do all your bump mapping, one to do all of your fog effects, one to do reflections, one to do refractions.
Now outside the scope of your question I think it would pertain here as well, one thing to incorporate into your engine would be a BatchProcess & BatchManager class setup to reduce the amount of CPU - GPU calls over the bus as this would prove efficient as well. I don't think there is a 1 fits all solution to your question as I would believe that it would be application specific just as setting up the relationship between how many batches (buckets) of vertices (primitives) your engine would have and how many vertices each of those batches would contain.
To try to make this a bit more clear: one game might have 4 containers or batches where each batch can hold up to 10,000 vertices to be considered to be full before the BatchManager decides to empty that bucket sending all of those vertices to the Graphics Card for the Rendering pipeline to be processed and drawn where a different game may have 10 buckets with 5,000 vertices, or another game might have 8 buckets with 12,0000 vertices.
So there could be a trade off of trying to combine the two according to your needs. If you have 1 single program with 100s of uniforms; the single program is easier to manage within the pipeline, but the shaders would be over cumbersome to read and manage. Then again have shaders with very few uniforms is quite easy to read and manage but having 100s of programs is a little harder to manage on the CPU before linking and sending them to be rendered properly. I would personally try to find a middle ground to where I have enough programs to do each specific task that is completely unique from each other such as doing fog density on one and a volumetric shadow mapping on another where each program has just enough uniforms to do the calculations required.
The next step would then be to do some bench mark testing to see where you efficiency and your overhead are balanced to make the appropriate adjustments.


Vertex batches (geometry groups) and maximum VBO (vertex buffer) size

I did a lot of researches concerning the way to gather vertex data into groups commonly called batches.
Here's for me the 2 main interesting articles on the subject:
The first article explains what are the best practices on how to manipulate VBOs (max size, format etc).
The second presents a simple example on how to manage vertex memory using batches. According to the author each batch HAS TO contains an instance of a VBO (plus a VAO) and he insists strongly on the fact that the maximimum size of a VBO is ranged between 1Mo (1000000 bytes) to 4Mo (4000000 bytes). The first article advice the same thing. I quote "1MB to 4MB is a nice size according to one nVidia document. The driver can do memory management more easily. It should be the same case for all other implementations as well like ATI/AMD, Intel, SiS."
I have several questions:
1) Does the maximum byte size mentionned above is an absolute rule ? Is it so bad to allocate VBO with a byte size more important than 4Mo (for example 10 Mo) ?
2) How can we do concerning meshes with a total vertex byte size larger than 4Mo? Do I need to split the geometry into several batches?
3) Does a batch inevitably store as attribute a unique VBO or several batches can be store in a single VBO ? (It's two different ways but the first one seems to be the right choice). Are you agree ?
According to the author each batch handle a unique VBO with a maximum size between 1 and 4 Mo and the whole VBO HAS TO contain only vertex data sharing the same material and transformation information). So if I have to batch an other mesh with a different material (so the vertices can't be merged with existing bathes) I have to create a new batch with a NEW vbo instanciated.
So according to the author my second method is not correct : it's not adviced to store several batches into a single VBO.
Does the maximum byte size mentionned above is an absolute rule ? Is it so bad to allocate VBO with a byte size more important than 4Mo (for example 10 Mo) ?
That was a (very) old piece of info that is not necessarily valid on modern hardware.
The issue that led to the 4MB suggestion was about the driver being able to manage memory. If you allocated more memory than the GPU had, it would need to page some in and out. If you use smaller chunks for your buffer objects, the driver is more easily able to pick whole buffers to page out (because they're not in use at present).
However, this does not matter so much. The best thing you can do for performance is to avoid exceeding memory limits entirely. Paging things in and out hurts performance.
So don't worry about it. Note that I have removed this "advice" from the Wiki.
So according to the author my second method is not correct : it's not adviced to store several batches into a single VBO.
I think you're confusing the word "batch" with "mesh". But that's perfectly understandable; the author of that document you read doesn't seem to recognize the difference either.
For the purposes of this discussion, a "mesh" is a thing that is rendered with a single rendering command, which is conceptually separate from other things you would render. Meshes get rendered with certain state.
A "batch" refers to one or more meshes that could have been rendered with separate rendering commands. However, in order to improve performance, you use techniques to allow them all to be rendered with the same rendering command. That's all a batch is.
"Batching" is the process of taking a sequence of meshes and making it possible to render them as a batch. Instanced rendering is one form of batching; each instance is a separate "mesh", but you are rendering lots of them with one rendering call. They use their instance count to fetch their per-instance state data.
Batching takes many forms beyond instanced rendering. Batching often happens at the level of the artist. While the modeller/texture artist may want to split a character into separate pieces, each with their own textures and materials, the graphics programmer tells them to keep them as a single mesh that can be rendered with the same textures/materials.
With better hardware, the rules for batching can be reduced. With array textures, you can give each mesh a particular ID, which it uses to pick which array layer it uses when fetching textures. This allows the artists to give such characters more texture variety without breaking the batch into multiple rendering calls. Ubershaders are another form, where the shader uses that ID to decide how to do lighting rather than (or in addition to) texture fetching.
The kind of batching that the person you're citing is talking about is... well, very confused.
What do you think about that?
Well, quite frankly I think the person from your second link should be ignored. The very first line of his code: class Batch sealed is not valid C++. It's some C++/CX Microsoft invention, which is fine in that context. But he's trying to pass this off as pure C++; that's not fine.
I'm also not particularly impressed by his code quality. He contradicts himself a lot. For example, he talks about the importance of being able to allocate reasonable-sized chunks of memory, so that the driver can more freely move things around. But his GuiVertex class is horribly bloated. It uses a full 16 bytes, four floats, just for colors. 4 bytes (as normalized unsigned integers) would have been sufficient. Similarly, his texture coordinates are floats, when shorts (as unsigned normalized integers) would have been fine for his use case. That would cut his per-vertex cost down from 32 bytes to 10; that's more than a 3:1 reduction.
4MB goes a lot longer when you use reasonably sized vertex data. And the best part? The OpenGL Wiki page he linked to tells you to do exactly this. But he doesn't do it.
Not to mention, he has apparently written this batch manager for a GUI (as alluded to by his GuiVertex type). Yet GUIs are probably the least batch-friendly rendering scenario in game development. You're frequently having to change state like bound textures, the current program (which reads from the texture or not), blending modes, the scissor box, etc.
Now with modern GPUs, there are certainly ways to make GUI renderers a lot more batch-friendly. But he never talks about them. He doesn't mention techniques to use gl_ClipDistance as a way to do scissor boxes with per-vertex data. He doesn't talk about ubershader usage, nor does his vertex format provide an ID that would allow such a thing.
As previously stated, batching is all about not having state changes between objects. But he focuses entirely on vertex state. He doesn't talk about textures, programs, etc. He doesn't talk about techniques to allow multiple objects to be part of the same batch while still having separate transforms.
His class can't really be used for batching of anything that couldn't have just been a single mesh.

glUniform vs. single draw call performance

Suppose I have many meshes I'd like to render. I have two choices:
Bake transforms and colors for each mesh into a VBO and render with a single draw call.
Use glUniform for transforms and colors and use many draw calls (but still a single VBO)
Assuming the scene changes very little between frames, which method tends to be better?
There are more than those two choices. At least one more comes to mind:
Use attributes for transforms and colors and use many draw calls.
Choice 3 is similar to choice 2, but setting attributes (using calls like glVertexAttrib4f) is mostly faster than setting uniforms. The efficiency of setting uniforms is highly platform dependent. But they're generally not intended to be modified very frequently. They are called uniform for a reason. :)
That being said, choice 1 might be the best for your use case where the transforms/colors change rarely. If you're not doing this yet, you could try keeping the attributes that are modified in a separate VBO (with usage GL_DYNAMIC_DRAW), and the attributes that remain constant in their own VBO (with usage GL_STATIC_DRAW). Then make the necessary updates to the dynamic buffer with glBufferSubData.
The reality is that there are no simple rules to predict what is going to perform best. It will depend on the size of your data and draw calls, how frequent and large the data changes are, and also very much on the platform you run on. If you want to be confident that you're using the most efficient solution, you need to implement all of them, and start benchmarking.
Generally, option 1 (minimize number of draw calls) is the best advice. There are a couple of caveats:
I have seen performance fall off a cliff when using very large VBOs on at least one mobile device (assuming relevant for opengl-es tag). The explanation (from the vendor) involved internal buffers exceeding a certain size.
If putting all the information which would otherwise be conveyed with uniforms into vertex attributes significantly increases the size of the vertex buffer, the price you pay (in perhaps costly memory reads) of reading redundant information (because it doesn't really vary per vertex) might negate the savings of using fewer draw calls.
As always the best (but tiresome) advice is to test (I know this is particularly hard developing for mobile where there are many potential implementations your code could be running on). Try to keep your pipeline/toolchain flexible enough that you can easily try out and compare different options.

A triangle with 3 varyings of same value.. does GPU interpolate / waste performance?

I have a simple question of which I was unable to find solid facts about GPUs behaviour in case of 3 vertexes having the same varying output from vertex shader.
Does the GPU notice that case or does it try to interpolate when its not even needed ?
This might be interesting as there are quite some cases where you want a constantish varying available in fragment shader per triangle. Please don't just guess, try to bring up references or atleast reasons why you think its the one way or another.
The GPU does the interpolation, no matter if it's needed or not.
The reason is quite simple: checking if the varying variable has already been changed is very expensive.
Shaders are small programs, that are executed concurrently on different GPU cores. So if you would like to avoid that two different cores are computing the same value, you would have to "reserve" the output variable. So you need an additional data structure (like a flag or mutex) that every core can read. In your case this would mean, that three different cores have to read the same flag, and the first of them has to reserve it if it's not already reserved.
This has to happen atomically, meaning that the reserving core has to be the only one who is setting the flag at a time. To do this all other cores would e.g. have to be stopped for a tick. As you don't know the which cores are computing the vertex shader you would have to stop ALL other cores (on a GTX Titan this would be 2687 others).
Additionally, when the variable is set and a new frame is rendered, all the flags would have to be reset, so the race for the flag can begin again.
To conclude: you would need additional hardware in your GPU, that is expensive and slows down the rendering pipeline.
It is the programmers job to avoid that multiple shaders are producing the same output. So if you are doing your job right this does not happen or you know, that avoiding it (on the CPU) would cost more than ignoring it.
An example would be the stiching for different levels of detail (like on a height map), where most methods are creating some fragments twice. This is a very small impact on the rendering performance but would require a lot of CPU time to avoid.
If the behavior isn't mandated in the OpenGL specificiation then the answer is that it's up to the implementation.
The comments and other answers are almost certainly spot on that there is no optimization path for identical values because there would be little to no benefit from the added complexity to make such a path.

which is the most optimal and correct way to drawing many different dynamic 3D models (they are animated and change every frame)

I need to know how I can render many different 3D models, which change their geometry to each frame (are animated models), don't repeat models and textures.
I carry all models and for each created an "object" model class.
What is the most optimal way to render them?
To use 1 VBO for each 3D model
To use a single VBO for all models (to be all different, I do not see this option possible)
I work with OpenGL 3.x or higher, C++ on Windows.
TL; DR - there's no silver bullet when it comes to rendering performance
Why is that? That depends on the complicated process that gets your data, converts it, pushes it to GPU and then makes pixels on the screen flicker. So, instead of "one best way", a few of guideliness appeared that might usually improve the performance.
Keep all the necessary data on the GPU (because the closer to the screen, the shorter way electrons have to go :))
Send as little data to GPU between frames as possible
Don't sync needlessly between CPU and GPU (that's like trying to run two high speed trains on parallel tracks, but insisting on slowing them down to the point where you can pass something through the window every once in a while),
Now, it's obvious that if you want to have a model that will change, you can't have the cake and eat it. You have to made tradeoffs. Simply put, dynamic objects will never render as fast as static ones. So, what should you do?
Hint GPU about the data usage (GL_STREAM_DRAW or GL_DYNAMIC_DRAW) - that should guarantee optimal memory arrangement.
Don't use interleaved buffers to mix static vertex attributes with dynamic ones - if you divide the memory, you can batch-update the geometry leaving texture coordinates intact, for example.
Try to do as much as you can purely on the GPU - with compute shaders and transform feedback, it might well be possible to store whole animation data as a buffer itself and calculate it on GPU, avoiding expensive syncs.
And last but not least, always carefully measure the impact of your change on performance. Going blindly won't help. Measure accurately and thoroughly (even stuff like shader compilation time might matter sometimes!). Then, even if you go by trial-and-error, there's a hope you'll get somewhere.
And to address one of your points in particular; whether it's one large VBO and a few smaller ones doesn't really matter, but a huge one might have problems in fitting in memory. You can still update parts of it, and what matters most is the memory arrangement inside of it.

Which is a larger performance drain: quantity of vertices in one draw call, or quantity of calls?

I am quickly finding that one of the organisational considerations you must make when preparing rendering in OpenGL is the type of topography and the arrangement of vertices.
Now there are some interesting methods out there for organising vertices into very long arrays, with nice uses of interleaved arrays, indexes, etc, so that you can pour a lot of geometry into one OpenGL call.
But it's much easier in some cases to simply iterate and perform multiple calls with smaller vertex arrays.
While I agree with the notion that premature optimization is somewhat wasteful, just how important of a consideration should it be to minimize OpenGL calls, especially if multiple calls would actually involve far fewer vertexes per call?
I can see that this is one of those decisions that is important early in the development process, since it forms a lot of the structure of how vertexes get created and organized.
There is an overhead for each command you send down to the GPU. By batching the vertices you minimize that overhead and also allows the driver to make small optimizations in you data before sending it to the hardware. It can make quite a difference and is the reason the glBegin and glEnd was completely removed from newer iterations of OpenGL.
You should try to avoid making many driver states changes and many drawing calls.
EDIT: Consider using degenerated vertices in you triangle strips (also helps in minimizing the number of vertices processed) so that you can just use one drawing call and render all your topology (unless you need to change some driver state between parts of the topology).
You can find a balance for your specific needs. But the thing is that there're many variables in the equation. And there's no simple solution (like "always make scene as one big single batch!"). TraxNet gave you a good advice though - always try to minimize api calls(whether drawing or state changes). But it hasn't to be just a few calls. On modern PC it could be thousands per frame, not so modern mobile phone, maybe, just a few hundred.
Also TraxNet mentioned degenerate triangles(helping form strips) - though they're still triangles(kinda add to 'total' triangle count rendered) - they cost almost nothing still helping to minimize amount of draw calls.