Z-fighting after depth prepass on GTX 980 - opengl

I'm implementing a depth prepass in OpenGL. On an Intel HD Graphics 5500, this code works fine but on a Nvidia GeForce GTX 980 it doesn't (the image below shows the resulting z-fighting). I'm using the following code to generate the image. (Everything irrelevant to the problem is omitted.)
// ----------------------------------------------------------------------------
// Depth Prepass
// ----------------------------------------------------------------------------
glUseProgam(program1); // The problem turned out to be here!
// ----------------------------------------------------------------------------
// Scene Rendering
// ----------------------------------------------------------------------------
glUseProgam(program2); // The problem turned out to be here!
It seems like the glDepthFunc is not changed to GL_LEQUAL. However, when I step through the GL calls in RenderDoc, glDepthFunc is set correnctly.
Does this sound like a driver bug or do you have suggestions what I could be doing wrong? When this is a driver bug, how can I implement a depth prepass anyway?

When using a different shader program for the depth prepass it must be explicitly assured that this program generates the same depth values (although called on the same geometry) as the program for the main pass. This is done by using the invariant qualifier on gl_Position.
Variance explained by the GLSL Specification 4.4:
In this section, variance refers to the possibility of getting different values from the same expression in
different programs. For example, say two vertex shaders, in different programs, each set gl_Position with
the same expression in both shaders, and the input values into that expression are the same when both
shaders run. It is possible, due to independent compilation of the two shaders, that the values assigned to
gl_Position are not exactly the same when the two shaders run. In this example, this can cause problems
with alignment of geometry in a multi-pass algorithm.
The qualifier is used as follows in this case:
invariant gl_Position;
This line guarantees that gl_Position is computed by the exact expression that was given in the shader without any optimization as this would change the operations and therefore quite likely change the result in some minor way.
In my concrete case, an assignment was the source of the problem. The Vertex Shader of the program for the main pass contained the following lines:
fWorldPosition = ModelMatrix*vPosition; // World position to the fragment shader
gl_Position = ProjectionMatrix*ViewMatrix*fWorldPosition;
The Vertex Shader of the program for the prepass computed gl_Position in one expression:
gl_Position = ProjectionMatrix*ViewMatrix*ModelMatrix*vPosition;
By changing this into:
vec4 worldPosition = ModelMatrix*vPosition;
gl_Position = ProjectionMatrix*ViewMatrix*worldPosition;
I solved the problem.

Some of the versions of Sponza are huge. I remember I fixed that issues with one of two solutions:
24bit depth buffer.
Logarithmic depth buffer.
The second approach is inferior on tiled architectures which leverages early depth test for hidden surface removal. On mobile platforms performance degradation can be quite noticeable.


Using 'discard' in GLSL 4.1 fragment shader with multisampling

I'm attempting depth peeling with multisampling enabled, and having some issues with incorrect data ending up in my transparent layers. I use the following to check if a sample (originally a fragment) is valid for this pass:
float depth = texelFetch(depthMinima, ivec2(gl_FragCoord.xy), gl_SampleID).r;
if (gl_FragCoord.z <= depth)
Where depthMinima is defined as
uniform sampler2DMS depthMinima;
I have enabled GL_SAMPLE_SHADING which, if I understand correctly, should result in the fragment shader being called on a per-sample basis. If this isn't the case, is there a way I can get this to happen?
The result is that the first layer or two look right, but beneath that (and I'm doing 8 layers) I start getting junk values - mostly plain blue, sometimes values from previous layers.
This works fine for single-sampled buffers, but not for multi-sampled buffers. Does the discard keyword still discard the entire fragment?
I have enabled GL_SAMPLE_SHADING which, if I understand correctly, should result in the fragment shader being called on a per-sample basis.
It's not enough to only enable GL_SAMPLE_SHADING. You also need to set:
A value of 1.0 indicates that each sample in the framebuffer should be indpendently shaded. A value of 0.0 effectively allows the GL to ignore sample rate shading. Any value between 0.0 and 1.0 allows the GL to shade only a subset of the total samples within each covered fragment. Which samples are shaded and the algorithm used to select that subset of the fragment's samples is implementation dependent.
– glMinSampleShading
In other words 1.0 tells it to shade all samples. 0.5 tells it to shade at least half the samples.
// Check the current value
GLfloat value;
If either GL_MULTISAMPLE or GL_SAMPLE_SHADING is disabled then sample shading has no effect.
There'll be multiple fragment shader invocations for each fragment, to which each sample is a subset of the fragment. In other words. Sample shading specifies the minimum number of samples to process for each fragment.
If GL_MIN_SAMPLE_SHADING_VALUE is set to 1.0 then there'll be issued a fragment shader invocation for each sample (within the primitive).
If its set to 0.5 then there'll be a shader invocation for every second sample.
Each being evaluated at their sample location (gl_SamplePosition).
With gl_SampleID being the index of the sample that is currently being processed.
Should discard work on a per-sample basis, or does it still only work per-fragment?
With or without sample shading discard still only terminate a single invocation of the shader.
Fragment Shader
Per-Sample Processing
I faced a similar problem when using depth_peeling on a multi-sample buffer.
Some artifacts appears due to the depth_test error when using a multi_sample depth texture from the previous peel and the current fragment depth.
vec4 previous_peel_depth_tex = texelFetch(previous_peel_depth, coord, 0);
the third argument is the sample you want to use for your comparison which will give a different value from the fragment center. Like the author said you can use gl_SampleID
vec4 previous_peel_depth_tex = texelFetch(previous_peel_depth, ivec2(gl_FragCoord.xy), gl_SampleID);
This solved my problem but with a huge performance drop, if you have 4 samples you will run your fragment shader 4 times, if 4 have peels it means 4x4 calls. You don't need to set the opengl flags if atleast glEnable(GL_MULTISAMPLE); is on
Any static use of [gl_SampleID] in a fragment shader causes the entire
shader to be evaluated per-sample
I decided to use a different approach and to add a bias when doing the depth comparison
float previous_linearized = linearize_depth(previous_peel_depth_tex.r, near, far);
float current_linearized = linearize_depth(gl_FragCoord.z, near, far);
float bias_meter = 0.05;
bool belong_to_previous_peel = delta_depth < bias_meter;
This solve my problem but some artifacts might still appears and you need to adjust your bias in your eye_space units (meter, cm, ...)

Why is a simple shader slower than the standard pipeline?

I want to write a very simple shader which is equivalent to (or faster) than the standard pipeline. However, even the simplest shader possible:
Vertex Shader
void main(void)
gl_TexCoord[0] = gl_MultiTexCoord0;
gl_Position = ftransform();
Fragment Shader
uniform sampler2D Texture0;
void main(void)
gl_FragColor = texture2D(Texture0, gl_TexCoord[0].xy);
Cuts my framerate half in my game, compared to the standard shader, and performs horrific if some transparent images are displayed. I don't understand this, because the standard shader (glUseProgram(0)) does lighting and alpha blending, while this shader only draws flat textures. What makes it so slow?
It looks like this massive slowdown of custom shaders is a problem with old Intel Graphics chips, which seem to emulate the shaders on the CPU.
I tested the same program on recent hardware and the frame drop with the custom shader activated is only about 2-3 percents.
EDIT: wrong theory. See new answer below
I think you might bump into overdraw.
I don't know what engine you are using your shader on, but if you have alpha blend on then you might end up overdrawing allot.
Think about it this way :
If you have a 800x600 screen, and a 2D quad over the whole screen, that 2D quad will have 480000 fragment shader calls, although it has only 4 vertexes.
Now, moving further, let's assume you have 10 such quads, on on top of another. If you don't sort your geometry Front to Back or if you are using alpha blend with no depth test, then you will end up with 10x800x600 = 4800000 fragment calls.
2D usually is quite expensive on OpenGL due to the overdraw. 3D rejects many fragments. Eventhou the shaders are more complicated, the number of calls are greatly reduced for 3D objects compared to 2D objects.
After long investigation, the slowdown of the simple shader was caused by the shader being too simple.
In my case, the slowdown was caused by the text rendering engine, which made heavy use of "glBitmap", which would be very slow with textures enabled (for whatever reason I cannot understand; these letters are tiny).
However, this did not affect the standard pipeline, as it would acknowledge the feature glDisable(GL_LIGHTING) and glDisable(GL_TEXTURE_2D ), which circumvents the slowdown, whereas the simple shader failed to do so and would thus even do more work as the standard pipeline. After introducing these two features to the custom shader, it is as fast as the standard pipeline, plus the ability to add random effects without any performance impact!

Weird noise on rendered objects - OpenGL

To be more specific, here's the screenshot:
After debugging for about 3 days, I really have no idea. Those black lines and strange fractal black segments just drive me nuts. The geometries are rendered by forward rendering, blending layer by layer for each light I add.
My first guess was downloading the newest graphics card driver (I'm using GTX 660m), but that didn't solve it. Can VSync be a possible issue here? (I'm rendering in a window rather on full screen mode) Or what is the most possible point to cause this kind of trouble?
My code is like this:
glBlendFunc(GL_ONE, GL_ONE);
/*loop here*/
/*draw for each light I had*/
One thing I've noticed looking at your lighting vertex shader code:
void main()
gl_Position = projectionMatrix * vec4(position, 1.0);
texCoord0 = texCoord;
normal0 = (normalMatrix * vec4(normal, 0)).xyz;
modelViewPos0 = (modelViewMatrix * vec4(position, 1)).xyz;
You are applying the projection matrix directly to the vertex position, which I'm assuming is in object space.
Try setting it to:
gl_Position = projectionMatrix * modelViewMatrix * vec4(position, 1.0);
And we can work from there.
This answer is slightly speculative, but based on the symptoms, and the code you posted, I suspect a precision problem. The rendering code you linked, looks like this in a shortened form:
for (Light light : lights) {
So you're rendering the same thing multiple times, with different shaders, and rely on the resulting pixels to end up with the same depth value (depth comparison function is GL_EQUAL). This is not a safe assumption. Quote from the GLSL spec:
In this section, variance refers to the possibility of getting different values from the same expression in different programs. For example, say two vertex shaders, in different programs, each set gl_Position with the same expression in both shaders, and the input values into that expression are the same when both shaders run. It is possible, due to independent compilation of the two shaders, that the values assigned to gl_Position are not exactly the same when the two shaders run. In this example, this can cause problems with alignment of geometry in a multi-pass algorithm.
I copied the whole paragraph because the example they are using sounds like an exact description of what you are doing.
To prevent this from happening, you can declare your out variables as invariant. In each of your vertex shaders that you use for the multi-pass rendering, add this line:
invariant gl_Position;
This guarantees that the outputs are identical if all the inputs are the same. To meet this condition, you should also make sure that you pass exactly the same transformation matrix into both shaders, and of course use the same vertex coordinates.

Is it possible to have face culling before the geometry shader stage?

Is it possible for only front facing triangles to be sent to the geometry shader? I believe that culling only happens to emitted triangles after the geometry shader by default.
Yes it is, back in the ancient days of Quake, face culling was done on the CPU using a simple dot product per-triangle. Any triangle that failed the test was not included in the list of indices drawn.
This is not a viable optimization on most hardware these days, but you still see it employed from time to time in specialized applications. One such application I have seen a lot of is using the PS3's Cell SPEs to cull out triangles during batching to save vertex transform workload on the PS3's RSX GPU - keep in mind, the PS3 still uses a basic shader architecture where there are a fixed number of specialized vertex shader units and fragment shader units. Balancing shader workload is important on that GPU.
I may be missing the point of your question though; what benefit do you expect/want to get out of culling the primitives early?
What I was trying to say is that on modern hardware and software, vertex transform / primitive assembly is usually not a bottleneck. Fragment processing is much more expensive these days, so having primitives culled during rasterization is usually the extent to which you have to worry about things for performance. The PS3's RSX is a special case, it has very poor vertex performance and a CPU architecture that is hard to keep busy, so it makes sense to offload primitive culling to the CPU.
You can still cull triangles before the vertex shader/tessellation/geometry shader on the CPU, but storing normals per-triangle somewhere and transferring a new set of indices to draw each frame hardly makes this a wise use of resources. You may spend more time and memory setting up the reduced list of triangles than you would of if you processed them on the GPU and let GL throw the backward facing primitives out during rasterization.
There is at least one use-case that comes to mind where this actually could still be a useful thing to do. I am referring to tessellated patches. If you can determine on the CPU before tessellation occurs that the entire patch faces the wrong way, you can skip having to tessellate them on the GPU. Ordinarily rendering will not be vertex-bound these days, but tessellation is one case where it may be.
It is actually possible. Like the other answer mentioned you can do it on the software side. But there are stages in-between the vertex shader and geometry shader. Namely, the hull (programmable), primitive generator (fixed), and domain (programmable). In the hull shader you can specify tessellation levels for your patch.
If you set any of these levels to 0.0, then the patch will be discarded and it will not enter the geometry shader!
Hope this helps :)
Although it doesn't technically satisfy the question because the triangle reaches the geometry shader, you can cull triangles within the geometry shader itself. This might be the solution people reading this question are looking for, it was for me.
I used the shader below to implement wireframe drawing of quads with culling, where each quad is drawn using two triangles. The idea of using
with culling includes the diagonal of each quad, which isn't desired, so I use a geometry shader to convert each triangle to just the two lines. However now culling doesn't work because only lines are emitted by the geometry shader. Instead culling is included at the beginning of the geometry shader, just set the uniform culling to -1, 0 or 1 for your desired behaviour (Note, this culling test assumes that the w coord is positive for each vertex).
#version 430 core
layout (triangles) in;
layout (line_strip, max_vertices=3) out;
uniform int culling;
void main()
// Perform culling here with an early return
mat3 M = mat3(gl_in[0].gl_Position.xyz, gl_in[1].gl_Position.xyz, gl_in[2].gl_Position.xyz);
if(culling * determinant(M) < 0)
EndPrimitive(); // Necessary?
gl_Position = gl_in[0].gl_Position;
gl_Position = gl_in[1].gl_Position;
gl_Position = gl_in[2].gl_Position;

Modern OpenGL colors

I noticed old code has GL_AMBIENT, GL_DIFFUSE, GL_SPECULAR etc. inputs with glMaterialfv. How are those replaced in modern GLSL code?
e.g. Assuming a library importing models (Assimp) gives direct values to such color categories , can they be still used directly (on core Context)?
Yes, at least sort of (though, of course, in modern code, you handle most of that computation in shaders).
One typical possibility is to use uniforms for your ambient color(s), light position(s), eye position, etc. Then set up a couple of varyings that will be used to pass a diffuse color and specular color from your vertex shader to your fragment shader. Your vertex shader computes values for those varyings based on the uniform inputs.
The fragment shader then receives (for example) a texture and the varyings mentioned above, and combines them together (along with any other inputs you might want) to produce a final color for the fragment (which it assigns to gl_FragColor).