When does it make sense to turn off the rasterization step? - c++

In vulkan there is a struct which is required for pipeline creation, named VkPipelineRasterizationStateCreateInfo. In this struct there is a member named rasterizerDiscardEnable. If this member is set to VK_TRUE then all primitives are discarded before the rasterization step. This disables any output to the framebuffer.
I cannot think of a scenario where this might make any sense. In which cases could it be useful?

It would be for any case where you're executing the rendering pipeline solely for the side effects of the vertex processing stage(s). For example, you could use a GS to feed data into a buffer, which you later render from.
Now in many cases you could use a compute shader to do something similar. But you can't use a CS to efficiently implement tessellation; that's best done by the hardware tessellator. So if you want to capture data generated by tessellation (presumably because you'll be rendering with it multiple times), you have to use a rendering process.

A useful side-effect (though not necessarily the intended use case) of this parameter is for benchmarking / determining the bottle-neck of your Vulkan application: If discarding all primitives before the rasterization stage (and thus before any fragment shaders are ever executed) does not improve your frame-rate then you can rule out that your application performance is fragment stage-bound.

Related

What is GPU driven rendering? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?
Overview
In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.
The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.
Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.
We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.
GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.
Mechanism
Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.
That's not helpful.
OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.
The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.
The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.
The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:
Update the scene graph in GPU memory.
Issue one or more compute shaders that generate multi-draw indirect rendering commands.
Issue a single multi-draw indirect call that draws everything.
Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.
Implementation
Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.
See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.
So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.
State and Shaders
If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.
But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.
When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.
However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.
Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.
Buffers
Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.
Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.
That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.
Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.
But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?
Well... you can't. So you cheat.
First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.
This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.
This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.
Textures
OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.
Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?
There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.
Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.
The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.
In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.
In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.
So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.
And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.

Performance gain of glColorMask()/glDepthMask() on modern hardware?

In my application I have some shaders which write only depth buffer to use it later for shadowing. Also I have some other shaders which render a fullscreen quad whose depth will not affect all subsequent draw calls, so it's depth values may be thrown away.
Assuming the application runs on modern hardware (produced 5 years ago till now), will I gain any additional performance if I disable color buffer writing (glColorMask(all to GL_FALSE)) for shadow map shaders, and depth buffer writing (with glDepthMask()) for fullscreen quad shaders?
In other words, do these functions really disable some memory operations or they just alter some mask bits which are used in fixed bitwise-operations logic in this part of rendering pipeline?
And the same question about testing. If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
My FPS measurement don't show any significant difference, but the result may be different on another machine.
Finally, if rendering runs faster with depth/color test/write disabled, how much faster does it run? Wouldn't this performance gain be negated by gl functions call overhead?
Your question is missing a very important thing: you have to do something.
Every fragment has color and depth values. Even if your FS doesn't generate a value, there will still be a value there. Therefore, every fragment produced that is not discarded will write these values, so long as:
The color is routed to a color buffer via glDrawBuffers.
There is an appropriate color/depth buffer attached to the FBO.
The color/depth write mask allows it to be written.
So if you're rendering and you don't want to write one of those colors or to the depth buffer, you've got to do one of these. Changing #1 or #2 is an FBO state change, which is among the most heavyweight operations you can do in OpenGL. Therefore, your choices are to make an FBO change or to change the write mask. The latter will always be the more performance-friendly operation.
Maybe in your case, your application doesn't stress the GPU or CPU enough for such a change to matter. But in general, changing write masks are a better idea than playing with the FBO.
If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
Are you changing other state at the same time, or is that the only state you're interested in?
One good way to look at these kinds of a priori performance questions is to look at Vulkan or D3D12 and see what it would require in that API. Changing any pipeline state there is a big deal. But changing two pieces of state is no bigger of a deal than one.
So if changing the depth test correlates with changing other state (blend modes, shaders, etc), it's probably not going to hurt any more.
At the same time, if you really care enough about performance for this sort of thing to matter, you should do application testing. And that should happen after you implement this, and across all hardware of interest. And your code should be flexible enough to easily switch from one to the other as needed.

Will updating a uniform value stall the whole rendering pipeline?

The glBufferSubData manpage's notes section contains the following paragraph:
Consider using multiple buffer objects to avoid stalling the rendering pipeline during data store updates. If any rendering in the pipeline makes reference to data in the buffer object being updated by glBufferSubData, especially from the specific region being updated, that rendering must drain from the pipeline before the data store can be updated.
While the glUniform* manpage doesn't mention the pipeline at all.
However, I would have thought that uniforms are just as important as buffers, given that they're supposed to be uniform across all shader invocations.
So, if I perform a draw call, change a uniform value and then perform another draw call on the same shader, will both draw calls run concurrently with different uniform values, or will the second draw call have to wait until every stage (vert/geom/frag) is complete on the first one?
The question in its general form is pretty much unanswerable. However consider this:
Since the advent of GLSL, and ARB's assembly language before that, uniform/parameter state has always been stored in the shader object. Only since uniform blocks and buffer objects has it been possible to separate uniform state from programs. So until that point, a good 5+ years, the only way to change a uniform was to change it in the program.
This means that pretty much every program that uses GLSL uses it in the standard way: bind a program, change uniforms, render, change uniforms, render, etc.
Now, imagine if doing this simple and obvious thing which hundreds of OpenGL programs did induced a full pipeline stall.
Driver developers are not stupid; even Intel's driver developers aren't that stupid. Whatever their hardware looks like, they can find a way to make uniform changes not induce a pipeline stall.

How exactly is GLSL's "coherent" memory qualifier interpreted by GPU drivers for multi-pass rendering?

The GLSL specification states, for the "coherent" memory qualifier: "memory variable where reads and writes are coherent with reads and writes from other shader invocations".
In practice, I'm unsure how this is interpreted by modern-day GPU drivers with regards to multiple rendering passes. When the GLSL spec states "other shader invocations", does that refer to shader invocations running only during the current pass, or any possible shader invocations in past or future passes? For my purposes, I define a pass as a "glBindFramebuffer-glViewPort-glUseProgram-glDrawABC-glDrawXYZ-glDraw123" cycle; where I'm currently executing 2 such passes per "render loop iteration" but may have more per iteration later on.
When the GLSL spec states "other shader invocations", does that refer to shader invocations running only during the current pass, or any possible shader invocations in past or future passes?
It means exactly what it says: "other shader invocations". It could be the same program code. It could be different code. It doesn't matter: shader invocations that aren't this one.
Normally, OpenGL handles synchronization for you, because OpenGL can track this fairly easily. If you map a range of a buffer object, modify it, and unmap it, OpenGL knows how much stuff you've (potentially) changed. If you use glTexSubImage2D, it knows how much stuff you changed. If you do transform feedback, it can know exactly how much data was written to the buffer.
If you do transform feedback into a buffer, then bind it as a source for vertex data, OpenGL knows that this will stall the pipeline. That it must wait until the transform feedback has completed, and then clear some caches in order to use the buffer for vertex data.
When you're dealing with image load/store, you lose all of this. Because so much could be written in a completely random, unknown, and unknowable fashion, OpenGL generally plays really loose with the rules in order to allow you flexibility to get the maximum possible performance. This triggers a lot of undefined behavior.
In general, the only rules you can follow are those outlined in section 2.11.13 of the OpenGL 4.2 specification. The biggest one (for shader-to-shader talk) is the rule on stages. If you're in a fragment shader, it is safe to assume that the vertex shader(s) that specifically computed the point/line/triangle for your triangle have completed. Therefore, you can freely load values that were stored by them. But only from the ones that made you.
Your shaders cannot make assumptions that shaders executed in previous rendering commands have completed (I know that sounds odd, given what was just said, but remember: "only from the ones that made you"). Your shaders cannot make assumptions that other invocations of the same shader, using the same uniforms, images, textures, etc, in the same rendering command have completed, except where the above applies.
The only thing you can assume is that writes your shader instance itself made are visible... to itself. So if you do an imageStore and do an imageLoad to the same memory location through the same variable, then you are guaranteed to get the same value back.
Well, unless someone else wrote to it in the meantime.
Your shaders cannot assume that a later rendering command will certainly fetch values written (via image store or atomic updates) by a previous one. No matter how much later! It doesn't matter what you've bound to the context. It doesn't matter what you've uploaded or downloaded (technically. Odds are you'll get correct behavior in some cases, but undefined behavior is still undefined).
If you need that guarantee, if you need to issue a rendering command that will fetch values written by image store/atomic updates, you must explicitly ask synchronize memory sometime after issuing the writing call and before issuing the reading call. This is done with glMemoryBarrier.
Therefore, if you render something that does image storing, you cannot render something that uses the stored data until an appropriate barrier has been sent (either explicitly in the shader or explicitly in OpenGL code). This could be an image load operation. But it could be rendering from a buffer object written by shader code. It could be a texture fetch. It could be doing blending to an image attached to the FBO. It doesn't matter; you can't do it.
Note that this applies for all operations that deal with image load/store/atomic, not just shader operations. So if you use image store to write to an image, you won't necessarily read the right data unless you use a GL_TEXTURE_UPDATE_BARRIER_BIT​ barrier.

Should I call glEnable and glDisable every time I draw something?

How often should I call OpenGL functions like glEnable() or glEnableClientState() and their corresponding glDisable counterparts? Are they meant to be called once at the beginning of the application, or should I keep them disabled and only enable those features I immediately need for drawing something? Is there a performance difference?
Warning: glPushAttrib / glPopAttrib are DEPRECATED in the modern OpenGL programmable pipeline.
If you find that you are checking the value of state variables often and subsequently calling glEnable/glDisable you may be able to clean things up a bit by using the attribute stack (glPushAttrib / glPopAttrib).
The attribute stack allows you to isolate areas of your code and such that changes to attribute in one sections does not affect the attribute state in other sections.
void drawObject1(){
glPushAttrib(GL_ENABLE_BIT);
glEnable(GL_DEPTH_TEST);
glEnable(GL_LIGHTING);
/* Isolated Region 1 */
glPopAttrib();
}
void drawObject2(){
glPushAttrib(GL_ENABLE_BIT);
glEnable(GL_FOG);
glEnable(GL_GL_POINT_SMOOTH);
/* Isolated Region 2 */
glPopAttrib();
}
void drawScene(){
drawObject1();
drawObject2();
}
Although GL_LIGHTING and GL_DEPTH_TEST are set in drawObject1 their state is not preserved to drawObject2. In the absence of glPushAttrib this would not be the case. Also - note that there is no need to call glDisable at the end of the function calls, glPopAttrib does the job.
As far as performance, the overhead due of individual function calls to glEnable/glDisable is minimal. If you need to be handling lots of state you will probably need to create your own state manager or make numerous calls to glGetInteger... and then act accordingly. The added machinery and control flow could made the code less transparent, harder to debug, and more difficult to maintain. These issues may make other, more fruitful, optimizations more difficult.
The attribution stack can aid in maintaining layers of abstraction and create regions of isolation.
glPushAttrib manpage
"That depends".
If you entire app only uses one combination of enable/disable states, then by all means just set it up at the beginning and go.
Most real-world apps need to mix, and then you're forced to call glEnable() to enable some particular state(s), do the draw calls, then glDisable() them again when you're done to "clear the stage".
State-sorting, state-tracking, and many optimization schemes stem from this, as state switching is sometimes expensive.
First of all, which OpenGL version do you use? And which generation of graphics hardware does your target group have? Knowing this would make it easier to give a more correct answer. My answer assumes OpenGL 2.1.
OpenGL is a state machine, meaning that whenever a state is changed, that state is made "current" until changed again explicitly by the programmer with a new OpenGL API call. Exceptions to this rule exist, like client state array calls making the current vertex color undefined. But those are the exceptions which define the rule.
"once at the beginning of the application" doesn't make much sense, because there are times you need to destroy your OpenGL context while the application is still running. I assume you mean just after every window creation. That works for state you don't need to change later. Example: If all your draw calls use the same vertex array data, you don't need to disable them with glDisableClientState afterwards.
There is a lot of enable/disable state associated with the old fixed-function pipeline. The easy redemption for this is: Use shaders! If you target a generation of cards at most five years old, it probably mimics the fixed-function pipeline with shaders anyway. By using shaders you are in more or less total control of what's happening during the transform and rasterization stages, and you can make your own "states" with uniforms, which are very cheap to change/update.
Knowing that OpenGL is a state machine like I said above should make it clear that one should strive to keep the state changes to a minimum, as long as it's possible. However, there are most probably other things which impacts performance much more than enable/disable state calls. If you want to know about them, read down.
The expense of state not associated with the old fixed-function state calls and which is not simple enable/disable state can differ widely in cost. Notably linking shaders, and binding names, ("names" of textures, programs, buffer objects) are usually fairly expensive. This is why a lot of games and applications used to sort the draw order of their meshes according to the texture. In that way, they didn't have to bind the same texture twice. Nowadays however, the same applies to shader programs. You don't want to bind the same shader program twice if you don't have to. Also, not all the features in a particular OpenGL version are hardware accelerated on all cards, even if the vendors of those cards claim they are OpenGL compliant. Being compliant means that they follow the specification, not that they necessarily run all the featurs efficiently. Some of the functions like glHistogram and glMinMax from GL__ ARB __imaging should be remembered in this regard.
Conclusion: Unless there is an obvious reason not to, use shaders! It saves you from a lot of uncecessary state calls since you can use uniforms instead. OpenGL shaders have just been around for about six years you know. Also, the overhead of enable/disable state changes can be an issue, but usually there is a lot more to gain from optimizing other more expensive state changes, like glUseProgram, glCompileShader, glLinkprogram, glBindBuffer and glBindTexture.
P.S: OpenGL 3.0 removed the client state enable/disable calls. They are implicitly enabled as draw arrays is the only way to draw in this version. Immediate mode was removed. gl..Pointer calls were removed as well, since one really just needs glVertexAttribPointer.
A rule of thumb that I was taught said that it's almost always cheaper to just enable/disable at will rather than checking the current state and changing only if needed.
That said, Marc's answer is something that should definitely work.