The OpenGL-Wiki states on the output limitations of geometry shaders:
The first limit, defined by GL_MAX_GEOMETRY_OUTPUT_VERTICES, is the maximum number that can be provided to the max_vertices output layout qualifier.
[...]
The other limit, defined by GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS is [...] the total number of output values (a component, in GLSL terms, is a component of a vector. So a float is one component; a vec3 is 3 components).
That's what the declarative part of my geometry shader looks like:
layout( triangles ) in;
layout( triangle_strip, max_vertices = 300 ) out;
out vec4 var1;
My vertex format only consists of 4 floats for position.
So I believe to have the 4 components from the varying var1 plus the 4 components from the position, i.e. 8 in total.
I have queried the following values for the constants mentioned above:
GL_MAX_GEOMETRY_OUTPUT_VERTICES = 36320
GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS = 36321
With max_vertices set to 300, a total of 8*300 = 2400 components would be written. It is needless to say that this value is far below 36321 as well as the 300 of max_vertices is far below 36320. So everything should be okay, right?
However, when I build the shader, the linking fails:
error C6033: Hardware limitation reached, can only emit 128 vertices of this size
Can somebody explain to me what is going on and why this doesn't work as I expected?
I made a really dumb mistake. For the record, if somebody else is having the same issue: Querying the values for GL_MAX_GEOMETRY_OUTPUT_VERTICES and GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS must be done through glGetInteger and not by just evaluating those macros.
Related
Given a compute shader where I have set the local size of each dimension to the values x, y and z, is there any way for me to access that information from the c++ code? ie,
//Pseudo Code c++
int size[3]
x = get local sizes from linked compute shader
print(x);
//GLSL Code
layout (local_size_x = a number, local_size_y = a number, local_size_z = a number) in;
Having run around looking, I found the following on Khronos.org, on its page concerning glGetProgramiv, found here:
https://www.khronos.org/registry/OpenGL-Refpages/es3/html/glGetProgramiv.xhtml
GL_COMPUTE_WORK_GROUP_SIZE
params returns an array of three integers containing the local work group size of the compute program as specified by its input layout qualifier(s). program must be the name of a program object that has been previously linked successfully and contains a binary for the compute shader stage.
This makes the line I needed
glGetProgramiv(ComputeShaderID, GL_COMPUTE_WORK_GROUP_SIZE, localWorkGroupSize);
where localWorkGroupSize is an array of 3 integers.
I ported this sample to to jogl from g-truc and it works, everything fine everything nice.
But now I am trying to understand exactly what the stream of glDrawTransformFeedbackStream refers to.
Basically a vec4 position input gets transformed to
String[] strings = {"gl_Position", "Block.color"};
gl4.glTransformFeedbackVaryings(transformProgramName, 2, strings, GL_INTERLEAVED_ATTRIBS);
as following:
void main()
{
gl_Position = mvp * position;
outBlock.color = vec4(clamp(vec2(position), 0.0, 1.0), 0.0, 1.0);
}
transform-stream.vert, transform-stream.geom
And then I simply render the transformed objects with glDrawTransformFeedbackStream
feedback-stream.vert, feedback-stream.frag
Now, based on the docs they say:
Specifies the index of the transform feedback stream from which to
retrieve a primitive count.
Cool, so if I bind my feedbackArrayBufferName to 0 here
gl4.glBindTransformFeedback(GL_TRANSFORM_FEEDBACK, feedbackName[0]);
gl4.glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, feedbackArrayBufferName[0]);
gl4.glBindTransformFeedback(GL_TRANSFORM_FEEDBACK, 0);
I guess it should be that.
Also the geometry shader outputs (only) the color to index 0. What about the positions? Are they assumed to be already on stream 0? How? From glTransformFeedbackVaryings?
Therefore, I tried to switch all the references to this stream to 1 to check if they are all consistent and then if they do refer to the same index.
So I modified
gl4.glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 1, feedbackArrayBufferName[0]);
and
gl4.glDrawTransformFeedbackStream(GL_TRIANGLES, feedbackName[0], 1);
and also inside the geometry shader
out Block
{
layout(stream = 1) vec4 color;
} outBlock;
But if I run, I get:
Program link failed: 1
Link info
---------
error: Transform feedback can't capture varyings belonging to different vertex streams in a single buffer.
OpenGL Error(GL_INVALID_OPERATION): initProgram
GlDebugOutput.messageSent(): GLDebugEvent[ id 0x502
type Error
severity High: dangerous undefined behavior
source GL API
msg GL_INVALID_OPERATION error generated. <program> object is not successfully linked, or is not a program object.
when 1455183474230
source 4.5 (Core profile, arb, debug, compat[ES2, ES3, ES31, ES32], FBO, hardware) - 4.5.0 NVIDIA 361.43 - hash 0x225c78a9]
GlDebugOutput.messageSent(): GLDebugEvent[ id 0x502
type Error
severity High: dangerous undefined behavior
source GL API
msg GL_INVALID_OPERATION error generated. <program> has not been linked, or is not a program object.
when 1455183474232
source 4.5 (Core profile, arb, debug, compat[ES2, ES3, ES31, ES32], FBO, hardware) - 4.5.0 NVIDIA 361.43 - hash 0x225c78a9]
Trying to know what'g going on, I found this here
Output variables in the Geometry Shader can be declared to go to a particular stream. This is controlled via an in-shader specification, but there are certain limitations that affect advanced component interleaving.
No two outputs that go to different streams can be captured by the same buffer. Attempting to do so will result in a linker error. So using multiple streams with interleaved writing requires using advanced interleaving to route attributes to different buffers.
Is it what happens to me? position going to index 0 and color to index 1?
I'd simply like to know if my hypotesis are correct. And if yes, I want to prove it by changing the stream index.
Therefore I'd also like to know how I can set the position on stream 1 together with color after my changes.. shall I modify the output of the geometry shader in this way layout(triangle_strip, max_vertices = 3, xfb_buffer = 1) out;?
Because it complains
Shader status invalid: 0(11) : error C7548: 'layout(xfb_buffer)' requires "#extension GL_ARB_enhanced_layouts : enable" before use
Then I add it and I get
error: Transform feedback can't capture varyings belonging to different vertex streams in a single buffer.
But now they should be both on stream 1, what I am missing?
Moreover, what is the definition of a stream?
I got problem when using gl_SampleMask with multisample texture.
To simplify problem I give this example.
Drawing two triangles to framebuffer with a 32x multisample texture attached.
Vertexes of triangles are (0,0) (100,0) (100,1) and (0,0) (0,1) (100,1).
In fragment shader, I have code like this,
#extension GL_NV_sample_mask_override_coverage : require
layout(override_coverage) out int gl_SampleMask[];
...
out_color = vec4(1,0,0,1);
coverage_mask = gen_mask( gl_FragCoord.x / 100.0 * 8.0 );
gl_SampleMask[0] = coverage_mask;
function int gen_mask(int X) generates a integer with X 1s in it's binary representation.
Hopefully, I'd see 100 pixel filled with full red color.
But actually I got alpha-blended output. Pixel at (50,0) shows (1,0.25,0.25), which seems to be two (1,0,0,0.5) drawing on (1,1,1,1) background.
However, if I break the coverage_mask, check gl_SampleID in fragment shader, and write (1,0,0,1) or (0,0,0,0) to output color according to coverage_mask's gl_SampleID's bit,
if ((coverage_mask >> gl_SampleID) & (1 == 1) ) {
out_color = vec4(1,0,0,1);
} else {
out_color = vec4(0,0,0,0);
}
I got 100 red pixel as expected.
I've checked OpenGL wiki and document but didn't found why the behavior changed here.
And, i'm using Nvidia GTX 980 with driver version 361.43 on Windows 10.
I'd put the test code to GitHub later if necessary.
when texture has 32 samples, Nvidia's implementation split one pixel to four small fragment, each have 8 samples. So in each fragment shader there are only 8-bit gl_SampleMask available.
OK, let's assume that's true. How do you suppose NVIDIA implements this?
Well, the OpenGL specification does not allow them to implement this by changing the effective size of gl_SampleMask. It makes it very clear that the size of the sample mask must be large enough to hold the maximum number of samples supported by the implementation. So if GL_MAX_SAMPLES returns 32, then gl_SampleMask must have 32 bits of storage.
So how would they implement it? Well, there's one simple way: the coverage mask. They give each of the 4 fragments a separate 8 bits of coverage mask that they write their outputs to. Which would work perfectly fine...
Until you overrode the coverage mask with override_coverage. This now means all 4 fragment shader invocations can write to the same samples as other FS invocations.
Oops.
I haven't directly tested NVIDIA's implementation to be certain of that, but it is very much consistent with the results you get. Each FS instance in your code will write to, at most, 8 samples. The same 8 samples. 8/32 is 0.25, which is exactly what you get: 0.25 of the color you wrote. Even though 4 FS's may be writing for the same pixel, each one is writing to the same 25% of the coverage mask.
There's no "alpha-blended output"; it's just doing what you asked.
As to why your second code works... well, you fell victim to one of the classic C/C++ (and therefore GLSL) blunders: operator precedence. Allow me to parenthesize your condition to show you what the compiler thinks you wrote:
((coverage_mask >> gl_SampleID) & (1 == 1))
Equality testing has a higher precedence than any bitwise operation. So it gets grouped like this. Now, a conformant GLSL implementation should have failed to compile because of that, since the result of 1 == 1 is a boolean, which cannot be used in a bitwise & operation.
Of course, NVIDIA has always had a tendency to play fast-and-loose with GLSL, so it doesn't surprise me that they allow this nonsense code to compile. Much like C++. I have no idea what this code would actually do; it depends on how a true boolean value gets transformed into an integer. And GLSL doesn't define such an implicit conversion, so it's up to NVIDIA to decide what that means.
The traditional condition for testing a bit is this:
(coverage_mask & (0x1 << gl_SampleID))
It also avoids undefined behavior if coverage_mask isn't an unsigned integer.
Of course, doing the condition correctly should give you... the exact same answer as the first one.
I have a shader program with a for loop in the geometry shader. The program links (and operates) fine when the for loop length is small enough. If I increase the length then I get a link error (with empty log). The shaders compile fine in both cases. Here is the geometry shader code (with everything I thought relevant):
#version 330
layout (points) in;
layout (triangle_strip, max_vertices = 256) out;
...
void main()
{
...
for(int i = 0 ; i < 22 ; ++i) // <-- Works with 22, not with 23.
{
...
EmitVertex();
...
EmitVertex();
...
EmitVertex();
...
EmitVertex();
EndPrimitive();
}
}
The specs state: "non-terminating loops are allowed. The consequences of very long or non-terminating loops are platform dependent." Could this be a platform dependent situation (GeForce GT 640)? As the shader code evolved, the max length of the for loop changed (more code -> smaller max), leading me to suspect it has something to do with loop unrolling. Can anyone give me any more info on this issue? (Let me know if you need more code/description.)
One possible reason for failure to link programs containing geometry shaders as the GL_MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS limit. Section 11.3.4.5 "Geometry Shader Outputs" of the OpenGL 4.5 core profile specifiaction states (my emphasis):
There are two implementation-dependent limits on the value of GEOMETRY_VERTICES_OUT; it may not exceed the value of MAX_GEOMETRY_OUTPUT_VERTICES, and the product of the total number of vertices and the sum of all
components of all active output variables may not exceed the value of MAX_GEOMETRY_TOTAL_OUTPUT_COMPONENTS. LinkProgram will fail if it determines
that the total component limit would be violated.
The GL guarantees that this toal component limit is at least 1024.
You did not paste the full code of your shaders, so that it is unclear how many components per vertex you are using, but it might be a reason for a link failure.
If I increase the length then I get a link error (with empty log).
The spec does not require any linker or compiler messages at all. However, Nvidia usually provides quite good log messages. If you can reproduce the "link failure without log message" scenario in the most current driver version, it might be worth filing a bug report.
So what I need is simple: each time we perform our shader (meaning on each pixel) I need to calculate random matrix of 1s and 0s with resolution == originalImageResolution. How to do such thing?
As for now I have created one for shadertoy random matrix resolution is set to 15 by 15 here because gpu makes chrome fall often when I try stuff like 200 by 200 while really I need full image resolution size
#ifdef GL_ES
precision highp float;
#endif
uniform vec2 resolution;
uniform float time;
uniform sampler2D tex0;
float rand(vec2 co){
return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * (43758.5453+ time));
}
vec3 getOne(){
vec2 p = gl_FragCoord.xy / resolution.xy;
vec3 one;
for(int i=0;i<15;i++){
for(int j=0;j<15;j++){
if(rand(p)<=0.5)
one = (one.xyz + texture2D(tex0,vec2(j,i)).xyz)/2.0;
}
}
return one;
}
void main(void)
{
gl_FragColor = vec4(getOne(),1.0);
}
And one for Adobe pixel bender:
<languageVersion: 1.0;>
kernel random
< namespace : "Random";
vendor : "Kabumbus";
version : 3;
description : "not as random as needed, not as fast as needed"; >
{
input image4 src;
output float4 outputColor;
float rand(float2 co, float2 co2){
return fract(sin(dot(co.xy ,float2(12.9898,78.233))) * (43758.5453 + (co2.x + co2.y )));
}
float4 getOne(){
float4 one;
float2 r = outCoord();
for(int i=0;i<200;i++){
for(int j=0;j<200;j++){
if(rand(r, float2(i,j))>=1.0)
one = (one + sampleLinear(src,float2(j,i)))/2.0;
}
}
return one;
}
void
evaluatePixel()
{
float4 oc = getOne();
outputColor = oc;
}
}
So my real problem is - my shaders make my GPU deiver fall. How to use GLSL for same purpose that I do now but with out failing and if possible faster?
Update:
What I want to create is called Single-Pixel Camera (google Compressive Imaging or Compressive Sensing), I want to create gpu based software implementation.
Idea is simple:
we have an image - NxM.
for each pixel in image we want GPU to performe the next operations:
to generate NxMmatrix of random values - 0s and 1s.
compute arithmetic mean of all pixels on original image whose coordinates correspond to coordinates of 1s in our random NxM matrix
output result of arithmetic mean as pixel color.
What I tried to implement in my shaders was simulate that wary process.
What is really stupid in trying to do this on gpu:
Compressive Sensing does not tall us to compute NxM matrix of such arithmetic mean values, it meeds just a peace of it (for example 1/3). So I put some pressure I do not need to on GPU. However testing on more data is not always a bad idea.
Thanks for adding more detail to clarify your question. My comments are getting too long so I'm going to an answer. Moving comments into here to keep them together:
Sorry to be slow, but I am trying to understand the problem and the goal. In your GLSL sample, I don't see a matrix being generated. I see a single vec3 being generated by summing a random selection (varying over time) of cells from a 15 x 15 texture (matrix). And that vec3 is recomputed for each pixel. Then the vec3 is used as the pixel color.
So I'm not clear whether you really want to create a matrix, or just want to compute a value for every pixel. The latter is in some sense a 'matrix', but computing a simple random value for 200 x 200 pixels would not strain your graphics driver. Also you said you wanted to use the matrix. So I don't think that's what you mean.
I'm trying to understand why you want a matrix - to preserve a consistent random basis for all the pixels? If so, you can either precompute a random texture, or use a consistent pseudorandom function like you have in rand() except not use time. You clearly know about that so I guess I still don't understand the goal. Why are you summing a random selection of cells from the texture, for each pixel?
I believe the reason your shader is crashing is that your main() function is exceeding its time limit - either for a single pixel, or for the whole set of pixels. Calling rand() 40,000 times per pixel (in a 200 * 200 nested loop) could certainly explain that!
If you had 200 x 200 pixels, and are calling sin() 40k times for each one, that's 160,000,000 calls per frame. Poor GPU!
I'm hopeful that if we understand the goal better, we'll be able to recommend a more efficient way to get the effect you want.
Update.
(Deleted this part, since it was mistaken. Even though many cells in the source matrix may each contribute less than a visually detectable amount of color to the result, the total of the many cells can contribute a visually detectable amount of color.)
New update based on updated question.
OK, (thinking "out loud" here so you can check whether I'm understanding correctly...) Since you need each of the random NxM values only once, there is no actual requirement to store them in a matrix; the values can simply be computed on demand and then thrown away. That's why your example code above does not actually generate a matrix.
This means we cannot get away from generating (NxM)^2 random values per frame, that is, NxM random values per pixel, and there are NxM pixels. So for N=M=200, that's 160 million random values per frame.
However, we can still optimize some things.
First, since your random values only need to be one bit each (you only need a boolean answer to decide whether to include each cell from the source texture into the mix), you can probably use a cheaper pseudo random number generator. The one you're using outputs much more random data per call than one bit. For example, you could call the same PRNG function as you're using now, but store the value and extract 32 random bits out of it. Or at least several, depending on how many are random enough. In addition, instead of using a sin() function, if you have extension GL_EXT_gpu_shader4 (for bitwise operators), you could use something like this:
.
int LFSR_Rand_Gen(in int n)
{
// <<, ^ and & require GL_EXT_gpu_shader4.
n = (n << 13) ^ n;
return (n * (n*n*15731+789221) + 1376312589) & 0x7fffffff;
}
Second, you are currently performing one divide operation per included cell (/2.0), which is probably relatively expensive, unless the compiler and GPU are able to optimize it into a bit shift (is that possible for floating point?). This also will not give the arithmetic mean of the input values, as discussed above... it will put much more weight on the later values and very little on the earlier ones. As a solution, keep a count of how many values are being included, and divide by that count once, after the loop is finished.
Whether these optimizations will be enough to enable your GPU driver to drive 200x200 * 200x200 pixels per frame, I don't know. They should definitely enable you to increase your resolution substantially.
Those are the ideas that occur to me off the top of my head. I am far from being a GPU expert though. It would be great if someone more qualified can chime in with suggestions.
P.S. In your comment, you jokingly (?) mentioned the option of precomputing N*M NxM random matrices. Maybe that's not a bad idea?? 40,000x40,000 is a big texture (40MB at least), but if you store 32 bits of random data per cell, that comes down to 1250 x 40,000 cells. Too bad vanilla GLSL doesn't help you with bitwise operators to extract the data, but even if you don't have the GL_EXT_gpu_shader4 extension you can still fake it. (Maybe you would also need a special extension then for non-square textures?)