Link error dependent on for loop length - opengl

I have a shader program with a for loop in the geometry shader. The program links (and operates) fine when the for loop length is small enough. If I increase the length then I get a link error (with empty log). The shaders compile fine in both cases. Here is the geometry shader code (with everything I thought relevant):
#version 330
layout (points) in;
layout (triangle_strip, max_vertices = 256) out;
...
void main()
{
...
for(int i = 0 ; i < 22 ; ++i) // <-- Works with 22, not with 23.
{
...
EmitVertex();
...
EmitVertex();
...
EmitVertex();
...
EmitVertex();
EndPrimitive();
}
}
The specs state: "non-terminating loops are allowed. The consequences of very long or non-terminating loops are platform dependent." Could this be a platform dependent situation (GeForce GT 640)? As the shader code evolved, the max length of the for loop changed (more code -> smaller max), leading me to suspect it has something to do with loop unrolling. Can anyone give me any more info on this issue? (Let me know if you need more code/description.)

One possible reason for failure to link programs containing geometry shaders as the GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS limit. Section 11.3.4.5 "Geometry Shader Outputs" of the OpenGL 4.5 core profile specifiaction states (my emphasis):
There are two implementation-dependent limits on the value of GEOMETRY_VERTICES_OUT; it may not exceed the value of MAX_GEOMETRY_OUTPUT_VERTICES, and the product of the total number of vertices and the sum of all
components of all active output variables may not exceed the value of MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS. LinkProgram will fail if it determines
that the total component limit would be violated.
The GL guarantees that this toal component limit is at least 1024.
You did not paste the full code of your shaders, so that it is unclear how many components per vertex you are using, but it might be a reason for a link failure.
If I increase the length then I get a link error (with empty log).
The spec does not require any linker or compiler messages at all. However, Nvidia usually provides quite good log messages. If you can reproduce the "link failure without log message" scenario in the most current driver version, it might be worth filing a bug report.

Related

What value of x would evaluate true for (x> 0. && x<= 0.) in webgl/glsl?

I have a thorny issue I'm trying to solve inside a webgl GLSL fragment shader. For context, I'm implementing a flocking simulation using a ping-pong technique. Everything works properly on PC & Mac, but is buggy on IOS (in Safari/Chrome/Firefox). There is no error, just behavior that is incorrect.
I've tracked it down to a place where I'm using a vec4 that is sampled from a texture, and that value seems corrupt in some way. If I do not use the sampled value, the results are similar on IOS as other devices.
I believe the sampled value has a corrupt, or otherwise unexpected value because of this behavior:
vec4 sampled = texture2D( inputTexture, textureCoordinate);
// a condition that shouldn't fire
// except perhaps for a 'special value' -
// (e.g. something like nil, null, NaN, undefined)
if (sampled.x > 0.0 && sampled.x <= 0.0) {
// do something that modifies behavior
// THIS FIRES! WHY!
}
Anyone have any clues that might point me in the right direction? Is there a good way to debug values within fragment shaders that might help?

OpenGL Crashes With Heavy Calculation [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am new to OpenGL. My first project consists on rendering a mandelbrot set (which I find quite fascinating) and due to the nature of the calculations that have to be done I thought it would be better to do them on the GPU (basically I apply a complex function on each point of a part of the complex plane, a lot of time, and I color this point based on the output : lots of parallelizable calculations, that seems nice for a GPU, right ?).
So everything works well when there aren't too many calculations for a single image but as soon as pixels*iterations go past about 9 billions, the program crashes (the image displayed shows that only part of it has been calculated, the cyan part is the initial background) :
Dark Part of the Mandelbrot Set not Fully Calculated
In fact, if total number of calculations is below this limit but close enough (say 8.5 billion) it will still crash but it will take way more time. So I guess that there is some kind of problem which doesn't appear at sufficiently small number of calculations (it has always worked flawlessly until it got there). I have really no idea of what it could be, since I am really new to this. When the program crashes it says : "Unhandled exception at 0x000000005DA6DD38 (nvoglv64.dll) in Mandelbrot Set.exe: Fatal program exit requested.". It is also the same address that is specified there (it only changes when I exit Visual Studio, my IDE).
Well here is the whole code, plus the shader files (vertex shader isn't doing anything, all calculations are in the fragment shader) :
EDIT :
Here is a link to all the .cpp and .h files of the project, the code is too large to be placed here and is correct anyway (though far from perfect) ;
https://github.com/JeffEkaka/Mandelbrot/tree/master
Here are the shaders :
NoChanges.vert (vertex shader)
#version 400
// Inputs
in vec2 vertexPosition; // 2D vec.
in vec4 vertexColor;
out vec2 fragmentPosition;
out vec4 fragmentColor;
void main() {
gl_Position.xy = vertexPosition;
gl_Position.z = 0.0;
gl_Position.w = 1.0; // Default.
fragmentPosition = vertexPosition;
fragmentColor = vertexColor;
}
CalculationAndColorShader.frag (fragment shader)
#version 400
uniform int WIDTH;
uniform int HEIGHT;
uniform int iter;
uniform double xmin;
uniform double xmax;
uniform double ymin;
uniform double ymax;
void main() {
dvec2 z, c;
c.x = xmin + (double(gl_FragCoord.x) * (xmax - xmin) / double(WIDTH));
c.y = ymin + (double(gl_FragCoord.y) * (ymax - ymin) / double(HEIGHT));
int i;
z = c;
for(i=0; i<iter; i++) {
double x = (z.x * z.x - z.y * z.y) + c.x;
double y = (z.y * z.x + z.x * z.y) + c.y;
if((x * x + y * y) > 4.0) break;
z.x = x;
z.y = y;
}
float t = float(i) / float(iter);
float r = 9*(1-t)*t*t*t;
float g = 15*(1-t)*(1-t)*t*t;
float b = 8.5*(1-t)*(1-t)*(1-t)*t;
gl_FragColor = vec4(r, g, b, 1.0);
}
I am using SDL 2.0.5 and glew 2.0.0, and last version of OpenGL I believe. Code has been compiled on Visual Studio (MSVC compiler I believe) with some optimizations enabled. Also, I am using doubles even in my gpu calculations (I know they are ultra-slow but I need their precision).
The first thing you need to understand is that "context switching" is different on GPUs (and, in general, most Heterogeneous architectures) than it is on CPU/Host architectures. When you submit a task to the GPU—in this case, "render my image"—the GPU will solely work on that task until completion.
There's a few details I'm abstracting, naturally: Nvidia hardware will try to schedule smaller tasks on unused cores, and all three major vendors (AMD, Intel, NVidia) have some fine-tuned behaviors which complicate my above generalization, but as a matter of principle, you should assume that any task submitted to the GPU will consume the GPU's entire resources until completed.
On its own, that's not a big problem.
But on Windows (and most consumer Operating Systems), if the GPU spends too much time on a single task, the OS will assume that the GPU isn't responding, and will do one of several different things (or possibly a subset of multiple of them):
Crash: doesn't happen so much anymore, but on older systems I have bluescreened my computers with over-ambitious Mandelbrot renders
Reset the driver: which means you'll lose all OpenGL state, and is essentially unrecoverable from the program's perspective
Abort the operation: Some newer device drivers are clever enough to simply kill the task rather than killing the entire context state. But this can depend on the specific API you're using: my OpenGL/GLSL based Mandelbrot programs tend to crash the driver, whereas my OpenCL programs usually have more elegant failures.
Let it go to completion, without issue: This will only happen if the GPU in question is not used by the Operating System as its display driver. So this is only an option if you have more than one Graphics card in your system and you explicitly ensure that rendering is happening on the Graphics Card not used by the OS, or if the card being used is a Compute Card that probably doesn't have display drivers associated with it. In OpenGL, this is basically a non-starter, but if you were using OpenCL or Vulkan, this might be a potential work-around.
The exact timing varies, but you should generally assume that if a single task takes more than 2 seconds, it'll crash the program.
So how do you fix this problem? Well, if this were an OpenCL-based render, it would be pretty easy:
std::vector<cl_event> events;
for(int32_t x = 0; x < WIDTH; x += KERNEL_SIZE) {
for(int32_t y = 0; y < HEIGHT; y += KERNEL_SIZE) {
int32_t render_start[2] = {x, y};
int32_t render_end[2] = {std::min(WIDTH, x + KERNEL_SIZE), std::min(HEIGHT, y + KERNEL_SIZE)};
events.emplace_back();
//I'm abstracting the clSubmitNDKernel call
submit_task(queue, kernel, render_start, render_end, &events.back(), /*...*/);
}
}
clWaitForEvents(queue, events.data(), events.size());
In OpenGL, you can use the same basic principle, but things are made a bit more complicated because of how absurdly abstract the OpenGL model is. Because the drivers are want to bundle together multiple draw calls into a single command to the underlying hardware, you need to explicitly make them behave themselves, or else the driver will bundle them all together, and you'll get the exact same problem even though you've written it to specifically break up the task.
for(int32_t x = 0; x < WIDTH; x += KERNEL_SIZE) {
for(int32_t y = 0; y < HEIGHT; y += KERNEL_SIZE) {
int32_t render_start[2] = {x, y};
int32_t render_end[2] = {std::min(WIDTH, x + KERNEL_SIZE), std::min(HEIGHT, y + KERNEL_SIZE)};
render_portion_of_image(render_start, render_end);
//The call to glFinish is the important part: otherwise, even breaking up
//the task like this, the driver might still try to bundle everything together!
glFinish();
}
}
The exact appearance of render_portion_of_image is something you'll need to design yourself, but the basic idea is to specify to the program that only the pixels between render_start and render_end are to be rendered.
You might be wondering what the value of KERNEL_SIZE should be. That's something you'll have to experiment on your own, as it depends entirely on how powerful your graphics card is. The value should be
Small enough that no single task will ever take more than x quantity of time (I usually go for 50 milliseconds, but as long as you keep it below half a second, it's usually safe)
Large enough that you're not submitting hundreds of thousands of tiny tasks to the GPU. At a certain point, you'll spend more time synchronizing the Host←→GPU interface than actually doing work on the GPU, and since GPU architectures often have hundreds or even thousands of cores, if your tasks are too small, you'll lose speed simply by not saturating all the cores.
In my personal experience, the best way to determine is to have a bunch of "testing" renders before the program starts, where you render the image at 10,000 iterations of the escape algorithm on a 32x32 image of the central bulb of the Mandelbrot Set (rendered all at once, with no breaking up of the algorithm), and seeing how long it takes. The algorithm I use essentially looks like this:
int32_t KERNEL_SIZE = 32;
std::chrono::nanoseconds duration = 0;
while(KERNEL_SIZE < 2048 && duration < std::chrono::milliseconds(50)) {
//duration_of is some code I've written to time the task. It's best to use GPU-based
//profiling, as it'll be more accurate than host-profiling.
duration = duration_of([&]{render_whole_image(KERNEL_SIZE)});
if(duration < std::chrono::milliseconds(50)) {
if(is_power_of_2(KERNEL_SIZE)) KERNEL_SIZE += KERNEL_SIZE / 2;
else KERNEL_SIZE += KERNEL_SIZE / 3;
}
}
final_kernel_size = KERNEL_SIZE;
The last thing I'd recommend is to use OpenCL for the heavy-duty lifting of rendering the mandelbrot set itself, and use OpenGL (including the OpenGL←→OpenCL Interop API!) to actually display the image on screen. OpenCL is, on a technical level, going to be neither faster nor slower than OpenGL, but it gives you a lot of control over the operations you perform, and it's easier to reason about what the GPU is doing (and what you need to do to alter its behavior) when you're using a more explicit API than OpenGL. You could, if you want to stick to a single API, use Vulkan instead, but since Vulkan is extremely low-level and thus very complicated to use, I don't recommend that unless you're up to the challenge.
EDIT: A few other things:
I'd have multiple versions of the program, one that renders with floats, and the other rendering with doubles. In my version of this program, I actually have a version that uses two float values to simulate a double, as described here. On most hardware this can be slower, but on certain architectures (particularly NVidia's Maxwell architecture) if the speed of processing floats is sufficiently fast enough, it can actually outperform double simply by sheer magnitude: on some GPU architectures, floats are 32x faster than doubles.
You might be tempted to have an "adaptive" algorithm that dynamically adjusts kernel size on the fly. This is more trouble than it's worth, and the time spent on the host reevaluating the next kernel size will outweigh any slight performance gains you otherwise achieve.

glDrawTransformFeedbackStream, what the stream refers to?

I ported this sample to to jogl from g-truc and it works, everything fine everything nice.
But now I am trying to understand exactly what the stream of glDrawTransformFeedbackStream refers to.
Basically a vec4 position input gets transformed to
String[] strings = {"gl_Position", "Block.color"};
gl4.glTransformFeedbackVaryings(transformProgramName, 2, strings, GL_INTERLEAVED_ATTRIBS);
as following:
void main()
{
gl_Position = mvp * position;
outBlock.color = vec4(clamp(vec2(position), 0.0, 1.0), 0.0, 1.0);
}
transform-stream.vert, transform-stream.geom
And then I simply render the transformed objects with glDrawTransformFeedbackStream
feedback-stream.vert, feedback-stream.frag
Now, based on the docs they say:
Specifies the index of the transform feedback stream from which to
retrieve a primitive count.
Cool, so if I bind my feedbackArrayBufferName to 0 here
gl4.glBindTransformFeedback(GL_TRANSFORM_FEEDBACK, feedbackName[0]);
gl4.glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, feedbackArrayBufferName[0]);
gl4.glBindTransformFeedback(GL_TRANSFORM_FEEDBACK, 0);
I guess it should be that.
Also the geometry shader outputs (only) the color to index 0. What about the positions? Are they assumed to be already on stream 0? How? From glTransformFeedbackVaryings?
Therefore, I tried to switch all the references to this stream to 1 to check if they are all consistent and then if they do refer to the same index.
So I modified
gl4.glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 1, feedbackArrayBufferName[0]);
and
gl4.glDrawTransformFeedbackStream(GL_TRIANGLES, feedbackName[0], 1);
and also inside the geometry shader
out Block
{
layout(stream = 1) vec4 color;
} outBlock;
But if I run, I get:
Program link failed: 1
Link info
---------
error: Transform feedback can't capture varyings belonging to different vertex streams in a single buffer.
OpenGL Error(GL_INVALID_OPERATION): initProgram
GlDebugOutput.messageSent(): GLDebugEvent[ id 0x502
type Error
severity High: dangerous undefined behavior
source GL API
msg GL_INVALID_OPERATION error generated. <program> object is not successfully linked, or is not a program object.
when 1455183474230
source 4.5 (Core profile, arb, debug, compat[ES2, ES3, ES31, ES32], FBO, hardware) - 4.5.0 NVIDIA 361.43 - hash 0x225c78a9]
GlDebugOutput.messageSent(): GLDebugEvent[ id 0x502
type Error
severity High: dangerous undefined behavior
source GL API
msg GL_INVALID_OPERATION error generated. <program> has not been linked, or is not a program object.
when 1455183474232
source 4.5 (Core profile, arb, debug, compat[ES2, ES3, ES31, ES32], FBO, hardware) - 4.5.0 NVIDIA 361.43 - hash 0x225c78a9]
Trying to know what'g going on, I found this here
Output variables in the Geometry Shader can be declared to go to a particular stream. This is controlled via an in-shader specification, but there are certain limitations that affect advanced component interleaving.
No two outputs that go to different streams can be captured by the same buffer. Attempting to do so will result in a linker error. So using multiple streams with interleaved writing requires using advanced interleaving to route attributes to different buffers.
Is it what happens to me? position going to index 0 and color to index 1?
I'd simply like to know if my hypotesis are correct. And if yes, I want to prove it by changing the stream index.
Therefore I'd also like to know how I can set the position on stream 1 together with color after my changes.. shall I modify the output of the geometry shader in this way layout(triangle_strip, max_vertices = 3, xfb_buffer = 1) out;?
Because it complains
Shader status invalid: 0(11) : error C7548: 'layout(xfb_buffer)' requires "#extension GL_ARB_enhanced_layouts : enable" before use
Then I add it and I get
error: Transform feedback can't capture varyings belonging to different vertex streams in a single buffer.
But now they should be both on stream 1, what I am missing?
Moreover, what is the definition of a stream?

Using gl_SampleMask with multisample texture doesn't get per-sample blend?

I got problem when using gl_SampleMask with multisample texture.
To simplify problem I give this example.
Drawing two triangles to framebuffer with a 32x multisample texture attached.
Vertexes of triangles are (0,0) (100,0) (100,1) and (0,0) (0,1) (100,1).
In fragment shader, I have code like this,
#extension GL_NV_sample_mask_override_coverage : require
layout(override_coverage) out int gl_SampleMask[];
...
out_color = vec4(1,0,0,1);
coverage_mask = gen_mask( gl_FragCoord.x / 100.0 * 8.0 );
gl_SampleMask[0] = coverage_mask;
function int gen_mask(int X) generates a integer with X 1s in it's binary representation.
Hopefully, I'd see 100 pixel filled with full red color.
But actually I got alpha-blended output. Pixel at (50,0) shows (1,0.25,0.25), which seems to be two (1,0,0,0.5) drawing on (1,1,1,1) background.
However, if I break the coverage_mask, check gl_SampleID in fragment shader, and write (1,0,0,1) or (0,0,0,0) to output color according to coverage_mask's gl_SampleID's bit,
if ((coverage_mask >> gl_SampleID) & (1 == 1) ) {
out_color = vec4(1,0,0,1);
} else {
out_color = vec4(0,0,0,0);
}
I got 100 red pixel as expected.
I've checked OpenGL wiki and document but didn't found why the behavior changed here.
And, i'm using Nvidia GTX 980 with driver version 361.43 on Windows 10.
I'd put the test code to GitHub later if necessary.
when texture has 32 samples, Nvidia's implementation split one pixel to four small fragment, each have 8 samples. So in each fragment shader there are only 8-bit gl_SampleMask available.
OK, let's assume that's true. How do you suppose NVIDIA implements this?
Well, the OpenGL specification does not allow them to implement this by changing the effective size of gl_SampleMask. It makes it very clear that the size of the sample mask must be large enough to hold the maximum number of samples supported by the implementation. So if GL_MAX_SAMPLES returns 32, then gl_SampleMask must have 32 bits of storage.
So how would they implement it? Well, there's one simple way: the coverage mask. They give each of the 4 fragments a separate 8 bits of coverage mask that they write their outputs to. Which would work perfectly fine...
Until you overrode the coverage mask with override_coverage. This now means all 4 fragment shader invocations can write to the same samples as other FS invocations.
Oops.
I haven't directly tested NVIDIA's implementation to be certain of that, but it is very much consistent with the results you get. Each FS instance in your code will write to, at most, 8 samples. The same 8 samples. 8/32 is 0.25, which is exactly what you get: 0.25 of the color you wrote. Even though 4 FS's may be writing for the same pixel, each one is writing to the same 25% of the coverage mask.
There's no "alpha-blended output"; it's just doing what you asked.
As to why your second code works... well, you fell victim to one of the classic C/C++ (and therefore GLSL) blunders: operator precedence. Allow me to parenthesize your condition to show you what the compiler thinks you wrote:
((coverage_mask >> gl_SampleID) & (1 == 1))
Equality testing has a higher precedence than any bitwise operation. So it gets grouped like this. Now, a conformant GLSL implementation should have failed to compile because of that, since the result of 1 == 1 is a boolean, which cannot be used in a bitwise & operation.
Of course, NVIDIA has always had a tendency to play fast-and-loose with GLSL, so it doesn't surprise me that they allow this nonsense code to compile. Much like C++. I have no idea what this code would actually do; it depends on how a true boolean value gets transformed into an integer. And GLSL doesn't define such an implicit conversion, so it's up to NVIDIA to decide what that means.
The traditional condition for testing a bit is this:
(coverage_mask & (0x1 << gl_SampleID))
It also avoids undefined behavior if coverage_mask isn't an unsigned integer.
Of course, doing the condition correctly should give you... the exact same answer as the first one.

Confusion about maximum output from Geometry Shaders

The OpenGL-Wiki states on the output limitations of geometry shaders:
The first limit, defined by GL_MAX_GEOMETRY_OUTPUT_VERTICES​, is the maximum number that can be provided to the max_vertices​ output layout qualifier.
[...]
The other limit, defined by GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS​ is [...] the total number of output values (a component, in GLSL terms, is a component of a vector. So a float​ is one component; a vec3​ is 3 components).
That's what the declarative part of my geometry shader looks like:
layout( triangles ) in;
layout( triangle_strip, max_vertices = 300 ) out;
out vec4 var1;
My vertex format only consists of 4 floats for position.
So I believe to have the 4 components from the varying var1 plus the 4 components from the position, i.e. 8 in total.
I have queried the following values for the constants mentioned above:
GL_MAX_GEOMETRY_OUTPUT_VERTICES = 36320
GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS = 36321
With max_vertices set to 300, a total of 8*300 = 2400 components would be written. It is needless to say that this value is far below 36321 as well as the 300 of max_vertices is far below 36320. So everything should be okay, right?
However, when I build the shader, the linking fails:
error C6033: Hardware limitation reached, can only emit 128 vertices of this size
Can somebody explain to me what is going on and why this doesn't work as I expected?
I made a really dumb mistake. For the record, if somebody else is having the same issue: Querying the values for GL_MAX_GEOMETRY_OUTPUT_VERTICES and GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS must be done through glGetInteger and not by just evaluating those macros.