Operations on shader program freezes GL context when using big enough SSBO

Operations on shader program freezes GL context when using big enough SSBO - opengl

I am doing parallel reduction in a compute shader. I am recursively computing the bounding box of chunks of fragments (starting from a G-buffer resulting from a scene render), and then the bounding box of chunks of bounding boxes, and so on until I end up with a single bounding box.
Initially I was doing this with a depth texture and a single vec2 holding the min and max depths, and I was storing the hierarchy in an SSBO like so :
// Each tile holds the min and max depth of the subtiles in the level under it.
// Each tile holds 32 items and is 8*4 or 4*8 depending on the level. Conceptually,
// the texDepth texture is level 4.
layout(std430, binding = 5) buffer aabbHierarchy
{
vec2 level0[8 * 4], // one 8*4 tile
level1[32 * 32], // 8*4 4*8 tiles
level2[256 * 128], // 32*32 8*4 tiles
level3[1024 * 1024]; // 256*128 4*8 tiles
};
Eventually I ran into issues and decided to switch to full AABBs. The structure and SSBO changed like so :
struct AABB
{
vec4 low, high;
};
layout(std430, binding = 5) buffer aabbHierarchy
{
AABB level0[8 * 4],
level1[32 * 32],
level2[256 * 128],
level3[1024 * 1024];
};
Of course I changed everything related to performing the actual calculation accordingly.
But now, as it turns out, the GL context freezes when I issue any call after a glUseProgram on this program. The glUseProgram call in itself has no problems whatsoever, but any GL call I do after it hangs my application. This obviously did not happen when initially using vec2.
I have done the math, and my SSBO is 34,636,800 bytes (with AABB), which is far smaller than the SSBO block size limit of 128 MB. At no point in my application is glCheckError returning anything other than 0, and all my shader compilations, buffer allocations and texture creations all work (at least they don't return an error). As an added bonus, allocating the same size of SSBO as with AABB but using it as vec2 in the shader does not make the application freeze.
I am using an OpenGL 4.4 context with #version 430 in the compute shader, with no extension. This is running on an ASUS RoG fx553vd with an Nvidia GeForce GTX 1050 in it.
EDIT : since I couldn't replicate in an MCVE, it must be something to do with the code around it. Still, the complete lack of error reporting is really weird. I was able to find the very line that provoked the error, and to reproduce the fact that in this context, any GL call (even as simple as glGetIntegerv) would freeze the application.
It's a pretty big project, so sorry for that. The problematic line is there, while the project's root is there. Please note that this is the aabbHierarchy branch, not master. I added extensive tracing just to make it clear where and when the program crashes.
EDIT 2 : I added an OpenGL debug context, and all it did for me was print out a couple lines of "buffer detailed info" that don't help.

Turns out it's a driver issue. I was able to provide a minimal working example which freezes on my Nvidia but works perfectly on my Intel integrated GPU.
For future reference, as of 02/06/2019 the Nvidia driver is taking ages to use compute shaders which declare arrays of structures in buffer interface blocks. This works fine :
layout(std430, binding = 0) buffer bleh
{
vec2 array[100000];
};
but this will take 5 actual seconds to finish executing glUseProgram :
struct AABB
{
vec2 a;
};
layout(std430, binding = 0) buffer bleh
{
AABB array[100000];
};
Mine looks like a freeze because I'm allocating over 30 MB of buffer-backed shader storage through a structure, which is a perfectly okay thing to do, but seeing how the Nvidia driver takes 5 seconds to deal with 100 KB, I can only wonder how much time that's going to take.
On my Intel GPU, the glUseProgram call for both executes instantly. I will report this to Nvidia ASAP.

Related

Slow compute shader, global vs local work groups?

I have created this simple compute shader to go through a 3D texture and set alpha values greater than 0 to 1:
#version 440 core
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0, RGBA8) uniform image3D voxelTexture;
void main() {
ivec3 pos = ivec3(gl_GlobalInvocationID);
vec4 value = imageLoad(voxelTexture, pos);
if(value.a > 0.0) {
value.a = 1.0;
imageStore(voxelTexture, pos, value);
}
}
I invoke it using the texture dimensions as work group count, size = 128:
opacityFixShader.bind();
glBindImageTexture(0, result.mID, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA8);
glDispatchCompute(size, size, size);
opacityFixShader.unbind();
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
Timing this in RenderDoc using a GTX 1080 Ti results in a whopping 3.722 ms, which seems way too long. I feel like I am not taking full advantage of compute, should I increase the local group size or something?

I feel like I am not taking full advantage of compute, should I increase the local group size or something?
Yes, definitively. An implementation-defined amount of invocations inside each work group will be bundled together as a Warp/Wavefront/Subgroup/Whatever-you-like-to-call-it and executed on the actual SIMD hardware units. For all practical purposes, you should use a multiple of 64 for the local size of the work group, otherwise you will waste lot of potential GPU power.
Your workload will totally be dominated by the memory accesses, so you should also think about optimizing your accesses for cache efficiency. Since you use a 3D texture, I would actually recommend to use a 3D local size like 4x4x4 or 8x8x8 so that you will profit from the 3D data organization your GPU most likely used for internally storing 3D texture data.
Side note:
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
Are you sure about that. If you are going to sample from the texture afterwards, that will be the wrong barrier.
Also:
I have created this simple compute shader to go through a 3D texture and set alpha values greater than 0 to 1
Why are you doing this? This might be a typical X-Y-problem. Spending a whole compute pass on just that might be a bad idea in the first place, and it will never make good use of the compute resources of the GPU. This operation could also potentially be done in the shaders where you actually use the texture, and it might be practically free of cost there because that shader is also very likely to be dominated by the latency of the texture accesses. Another point to consider is that you might access the texture with some texture filtering, and still get alpha values between 0 and 1 even after your pre-process (but maybe you want exactly that, though).

Strange kernel artifacts when writing to fragment out often

I'm developing a game engine and working on a deferred rendering pipeline. After finishing the (second pass) (shading) shader, I started testing the pipeline on various other computers I have. Interestingly, on my older laptop I get this strange artifacting on each 4x8 pixel group (example below). It looks like the shader is executing and ultimately returning the correct color, but in a very random fashion.
This question is not a bug report or a solution request. I have fixed this issue with the below code patch. This thread is rather to gather a better understanding as why this happens, and to provide insights for anyone else that may be affected by the same issue.
To describe the effect in more detail:
About 50% of the screen has a 4x8 group of pixels that highly tint the actual resulting color.
These 4x8 groups in random places on the screen each frame, causing a "static" effect.
Certain models tint different colours. As you can see below, the reflective bunny is tinted blue however the refractive spheres are tinted yellow. This doesn't seem to be a Gbuffer issue however as they both sample from the same texture which I'm sure is correct (as I can see it on screen at the same time).
Different object's 4x8 blocks have a higher rate of showing the correct color. You can see the refractive bunny is mostly correct, but the reflective floor and refractive spheres are simply white and yellow.
The tint colors of the 4x8 blocks change wildly depending on what other programs are running on the GPU.
The image should look like this:
Pseudocode of the broken shader was something like
out vec3 FragColor; // Out pixel of fragment shader
void main() {
for (int i=0; i<NumberOfPointLights; i++) {
... Lighting calculation code...
FragColor += lighting;
}
for (int i=0; i<NumberOfSpotLights; i++) {
... Lighting calculation code...
FragColor += lighting;
}
for (int i=0; i<NumberOfDirectionalLights; i++) {
... Lighting calculation code...
FragColor += lighting;
}
}
To fix the issue, I simply initialised a temporary variable to hold the output color, wrote to that during lighting calculations and then wrote that to the fragment output at the end. As follows:
out vec3 FragColor; // Out pixel of fragment shader
void main() {
vec3 outcolor = vec3(0);
for (int i=0; i<NumberOfPointLights; i++) {
... Lighting calculation code...
outcolor += lighting;
}
for (int i=0; i<NumberOfSpotLights; i++) {
... Lighting calculation code...
outcolor += lighting;
}
for (int i=0; i<NumberOfDirectionalLights; i++) {
... Lighting calculation code...
outcolor += lighting;
}
FragColor = outcolor;
}
I was surprised this worked as I assumed that this behaviour is assumed by default. That the writing to the fragment output doesn't actually write to VRAM each time, only at the end. I was under the impression that a fragment output variable is read from after shader execution, hence why its global.
Suspicions and Questions
From my research, I read that the 4x8 pixel groups is the size of one "work-group" or "core" on an nVidia GPU (which I am using) while AMD use 8x8 pixel work-groups. So Something is causing random work-groups' output color to be permanently affected until it is reassigned to a different location on screen.
The fact that the colours change depending on what else is using the GPU tells me that either the GPU has a very complicated memory allocation scheme and it's reading from other programs' memory (which I doubt) or that the shader is getting uninitialised memory every frame. But surely the same memory for the texture is written over each time?
Writing to the fragment out variable writes to VRAM each time, writing to it too many times per work-group causes the work-group to bail leaving mixed results behind. This would explain why a temporary/local variable works.
As I am always using += (read then write) The temporary variable initialisation acts as a explicit instruction to start with the color black and add to it, while writing to the fragment out directly adds to the last color of that pixel. If this where the case though, why did it work correctly on a higher-end PC?
Other details
My old laptop is using a GT540m with Optimus technology with integrated Intel 3000 graphics (which isn't being used here)
My newer desktop PC is using a GTX1070.
Both GPUs use very little VRAM during running the application, less than 100MB.
Shader is being compiled using #version 400 core

This is a driver bug. There's nothing more to look into here.
Output variables can be read from and written to. This is part of GLSL. So it seems the driver just screwed up that implementation of it.

Re-compiling shader in openGL

I'm writing my own OpenGL-3D-Application and have stumbled across a little problem:
I want the number of light sources to be dynamic. For this, my shader contains an array of my lights struct:uniform PointLight pointLights[NR_POINT_LIGHTS];
The variable NR_POINT_LIGHTS is set by preprocessor, and the command for this is generated by my applications code (Java). So when creating a shader program, I pass the desired start-amount of PintLights, complete the source text with the preprocessor command, compile, link and use. This works great.
Now I want to change this variable. I re-build the shader-source-string, re-compile and re-link a new shaderProgram and continue using this onoe. It just appears that all uniforms set in the old program are getting lost in the progress (of course, I once set them for the old program).
My ideas on how to fix this:
Don't compile a new program, but rather somehow change the source data for the currently running shaders and somehow re-compile them, to continue using the program with the right uniform values
Copy all uniform data from the old program to the newly generated one
What is the right way to do this? How do I do this? I'm not very experienced yet and don't know if any of my ideas is even possible.

You're looking for a Uniform Buffer or (4.3+ only) a Shader Storage Buffer.
struct Light {
vec4 position;
vec4 color;
vec4 direction;
/*Anything else you want*/
}
Uniform Buffer:
const int MAX_ARRAY_SIZE = /*65536 / sizeof(Light)*/;
layout(std140, binding = 0) uniform light_data {
Light lights[MAX_ARRAY_SIZE];
};
uniform int num_of_lights;
Host Code for Uniform Buffer:
glGenBuffers(1, &light_ubo);
glBindBuffer(GL_UNIFORM_BUFFER, light_ubo);
glBufferData(GL_UNIFORM_BUFFER, sizeof(GLfloat) * static_light_data.size(), static_light_data.data(), GL_STATIC_DRAW); //Can be adjusted for your needs
GLuint light_index = glGetUniformBlockIndex(program_id, "light_data");
glBindBufferBase(GL_UNIFORM_BUFFER, 0, light_ubo);
glUniformBlockBinding(program_id, light_index, 0);
glUniform1i(glGetUniformLocation(program_id, "num_of_lights"), static_light_data.size() / 12); //My lights have 12 floats per light, so we divide by 12.
Shader Storage Buffer (4.3+ Only):
layout(std430, binding = 0) buffer light_data {
Light lights[];
};
/*...*/
void main() {
/*...*/
int num_of_lights = lights.length();
/*...*/
}
Host Code for Shader Storage Buffer (4.3+ Only):
glGenBuffers(1, &light_ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, light_ssbo);
glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(GLfloat) * static_light_data.size(), static_light_data.data(), GL_STATIC_DRAW); //Can be adjusted for your needs
light_ssbo_block_index = glGetProgramResourceIndex(program_id, GL_SHADER_STORAGE_BLOCK, "light_data");
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, light_ssbo);
glShaderStorageBlockBinding(program_id, light_ssbo_block_index, 0);
The main difference between the two is that Uniform Buffers:
Have compatibility with older, OpenGL3.x hardware,
Are limited on most systems to 65kb per buffer
Arrays need to have their [maximum] size declared statically at the compile-time of the shader.
Whereas Shader Storage Buffers:
Require hardware no older than 5 years
Have a API mandated minimum allowable size of 16Mb (and most systems will allow up to 25% the total VRAM)
Can dynamically query the size of any arrays stored in the buffer (though this can be buggy on older AMD systems)
Can be slower than Uniform Buffers on the Shader side (roughly equivalent to a Texture Access)

Don't compile a new program, but rather somehow change the source data for the currently running shaders and somehow re-compile them, to continue using the program with the right uniform values
This isn't do-able at runtime if I'm understanding right (implying that you could change the shader-code of the compiled shader program) but if you modify the shader source text you can compile a new shader program. Thing is, how often do the number of lights change in your scene? Because this is a fairly expensive process to do.
You could specify a max number of lights if you don't mind having a limitation and only use the lights in the shader that have been populated with information, saving you the task of tweaking the source text and recompiling a whole new shader program, but that leaves you with a limitation on the number of lights (If you aren't planning on having absolutely loads of lights in your scene but are planning of having the number of lights change relatively often then this is probably going to be best for you)
However, if you really want to go down the route that you are looking at here:
Copy all uniform data from the old program to the newly generated one
You can look at using a Uniform Block. If you're going to be using shader programs with similar or shared uniforms, Uniform Blocks are a good way of managed those 'universal' uniform variables across your shade programs, or in your case the shader you are moving to as you grow the amount of lights in the shader. Theres a good tutorial on uniform blocks here
Lastly, depending on the OpenGL version you're using, you might still be able to achieve dynamic array sizes. OpenGL 4.3 introduced the ability to use buffers and have unbound array sizes, that you would use glBindBufferRange to send the length of your lights array to. You'll see more talk about that functionality in this question and this wiki reference.
The last would probably be my preference, but it depends on if you're aiming at hardware supporting older OpenGL versions.

Why is a simple shader slower than the standard pipeline?

I want to write a very simple shader which is equivalent to (or faster) than the standard pipeline. However, even the simplest shader possible:
Vertex Shader
void main(void)
{
gl_TexCoord[0] = gl_MultiTexCoord0;
gl_Position = ftransform();
}
Fragment Shader
uniform sampler2D Texture0;
void main(void)
{
gl_FragColor = texture2D(Texture0, gl_TexCoord[0].xy);
}
Cuts my framerate half in my game, compared to the standard shader, and performs horrific if some transparent images are displayed. I don't understand this, because the standard shader (glUseProgram(0)) does lighting and alpha blending, while this shader only draws flat textures. What makes it so slow?

It looks like this massive slowdown of custom shaders is a problem with old Intel Graphics chips, which seem to emulate the shaders on the CPU.
I tested the same program on recent hardware and the frame drop with the custom shader activated is only about 2-3 percents.
EDIT: wrong theory. See new answer below

I think you might bump into overdraw.
I don't know what engine you are using your shader on, but if you have alpha blend on then you might end up overdrawing allot.
Think about it this way :
If you have a 800x600 screen, and a 2D quad over the whole screen, that 2D quad will have 480000 fragment shader calls, although it has only 4 vertexes.
Now, moving further, let's assume you have 10 such quads, on on top of another. If you don't sort your geometry Front to Back or if you are using alpha blend with no depth test, then you will end up with 10x800x600 = 4800000 fragment calls.
2D usually is quite expensive on OpenGL due to the overdraw. 3D rejects many fragments. Eventhou the shaders are more complicated, the number of calls are greatly reduced for 3D objects compared to 2D objects.

After long investigation, the slowdown of the simple shader was caused by the shader being too simple.
In my case, the slowdown was caused by the text rendering engine, which made heavy use of "glBitmap", which would be very slow with textures enabled (for whatever reason I cannot understand; these letters are tiny).
However, this did not affect the standard pipeline, as it would acknowledge the feature glDisable(GL_LIGHTING) and glDisable(GL_TEXTURE_2D ), which circumvents the slowdown, whereas the simple shader failed to do so and would thus even do more work as the standard pipeline. After introducing these two features to the custom shader, it is as fast as the standard pipeline, plus the ability to add random effects without any performance impact!

Fast line drawing in OpenGL

I am working on a project that requires drawing a lot of data as it is acquired by an ADC...something like 50,000 lines per frame on a monitor 1600 pixels wide. It runs great on a system with a 2007-ish Quadro FX 570, but basically can't keep up with the data on machines with Intel HD 4000 class chips. The data load is 32 channels of 200 Hz data received in batches of 5 samples per channel 40 times per second. So, in other words, the card only needs to achieve 40 frames per second or better.
I am using a single VBO for all 32 channels with space for 10,000 vertices each. The VBO is essentially treated like a series of ring buffers for each channel. When the data comes in, I decimate it based on the time scale being used. So, basically, it tracks the min/max for each channel. When enough data has been received for a single pixel column, it sets the next two vertices in the VBO for each channel and renders a new frame.
I use glMapBuffer() to access the data once, update all of the channels, use glUnmapBuffer, and then render as necessary.
I manually calculate the transform matrix ahead of time (using an orthographic transform calculated in a non-generic way to reduce multiplications), and the vertex shader looks like:
#version 120
varying vec4 _outColor;
uniform vec4 _lBound=vec4(-1.0);
uniform vec4 _uBound=vec4(1.0);
uniform mat4 _xform=mat4(1.0);
attribute vec2 _inPos;
attribute vec4 _inColor;
void main()
{
gl_Position=clamp(_xform*vec4(_inPos, 0.0, 1.0), _lBound, _uBound);
_outColor=_inColor;
}
The _lBound, _uBound, and _xform uniforms are updated once per channel. So, 32 times per frame. The clamp is used to limit certain channels to a range of y-coordinates on the screen.
The fragment shader is simply:
#version 120
varying vec4 _outColor;
void main()
{
gl_FragColor=_outColor;
}
There is other stuff being render to the screen; channel labels, for example, using quads and a texture atlas; but profiling in gDEBugger seems to indicate that the line rendering takes the overwhelming majority of time per frame.
Still, 50,000 lines does not seem like a horrendously large number to me.
So, after all of that, the question is: are there any tricks to speeding up line drawing? I tried rendering them to the stencil buffer and then clipping a single quad, but that was slower. I thought about drawing the lines to a texture, the drawing a quad with the texture. But, that does not seem scalable or even faster due to uploading large textures constantly. I saw a technique that stores the y values in a single row texture, but that seems more like memory optimization rather than speed optimization.

Mapping a VBO might slow you down due to the driver might require to sync the GPU with the CPU. A more performant way is to just throw your data onto the GPU, so the CPU and GPU can run more independently.
Recreate the VBO every time, do create it with STATIC_DRAW
If you need to map your data, do NOT map as readable (GL_WRITE_ONLY)

Thanks, everyone. I finally settled on blitting between framebuffers backed by renderbuffers. Works well enough. Many suggested using textures, and I may go that route in the future if I eventually need to draw behind the data.

If you're just scrolling a line graph (GDI style), just draw the new column on the CPU and use glTexSubImage2D to update a single column in the texture. Draw it as a pair of quads and update the st coordinates to handle scrolling/wrapping.
If you need to update all the lines all the time, use a VBO created with GL_DYNAMIC_DRAW and use glBufferSubData to update the buffer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js