Possible bug in glGetUniformLocation with AMD drivers - c++

Using Radeon 3870HD I've run into strange behavior with AMD's drivers (updated yesterday to newest version available).
First of all I'd like to note that whole code ran without problem on NVIDIA GeForce 540M and that glGetUniformLocation isn't failing all the time.
So my problem is that I get from glGetUniformLocation strange values with one of my shader programs used in app, while the other shader program doesn't have such flaw. I switch that shaders between frames so I'm sure it isn't temporary and is related to that shader. By strange values I mean something like 17-26, while I have only 9 uniforms present. I'm using my interface for shaders that afterwards queries for type of GLSL variable with just obtained variable and as side effect I also query its name. For all those 17-26 locations I got returned the name wasn't set and same for type. Now I've got idea to debug into interface which is separate library and change those values to something I'd expect: 0-8. Using debugger I changed these and indeed they returned proper variable name in that shader and also type was correct.
My question is how possibly could the code that is working always with NVIDIA and also with that other shader on Radeon behave differently for another shader, that is treated the same way, fail?
I include related part of interface for this:
//this fails to return correct value
m_location = glGetUniformLocation(m_program.getGlID(), m_name.c_str());
printGLError();
if(m_location == -1){
std::cerr << "ERROR: Uniform " << m_name << " doesn't exist in program" << std::endl;
return FAILURE;
}
GLsizei charSize = m_name.size()+1, size = 0, length = 0;
GLenum type = 0;
GLchar* name = new GLchar[charSize];
name[charSize-1] = '\0';
glGetActiveUniform(m_program.getGlID(), m_location, charSize, &length, &size, &type, name);
delete name; name = 0;
if(!TypeResolver::resolve(type, m_type))
return FAILURE;
m_prepared = true;
m_applied = false;

The index you pass to glGetActiveUniform is not supposed to be a uniform location. Uniform locations are only used with glUniform calls; nothing else.
The index you pass to glGetActiveUniform is just a index between 0 and the value returned by glGetProgram(GL_ACTIVE_UNIFORMS). It is used to ask what uniforms exist and to inspect the properties of those uniforms.
Your code works on NVIDIA only because you got lucky. The OpenGL specification doesn't guarantee that the order of uniform locations is the same as the order of active uniform indices. AMD's drivers don't work that way, so your code doesn't work.

Related

glfwSwapBuffers slow (>3s)

The bounty expires in 7 days. Answers to this question are eligible for a +50 reputation bounty.
Paul Aner is looking for a canonical answer:
I think the reason for this question is clear: I want the main-loop to NOT lock while a compute shader is processing larger amounts of data. I could try and seperate the data into smaller snippets, but if the computations were done on CPU, I would simply start a thread and everything would run nice and smoothly. Altough I of course would have to wait until the calculation-thread delivers new data to update the screen - the GUI (ImGUI) would not lock up...
I have written a program that does some calculations on a compute shader and the returned data is then being displayed. This works perfectly, except that the program execution is blocked while the shader is running (see code below) and depending on the parameters, this can take a while:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
GLfloat* mapped = (GLfloat*)(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));
memcpy(Result, mapped, sizeof(GLfloat) * X * Y);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}
void main
{
// Initialization stuff
// ...
while (glfwWindowShouldClose(Window) == 0)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glfwPollEvents();
glfwSwapInterval(2); // Doesn't matter what I put here
CalculatateSomething(Result);
Render(Result);
glfwSwapBuffers(Window.WindowHandle);
}
}
To keep the main loop running while the compute shader is calculating, I changed CalculateSomething to something like this:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
GPU_sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
}
bool GPU_busy()
{
GLint GPU_status;
if (GPU_sync == NULL)
return false;
else
{
glGetSynciv(GPU_sync, GL_SYNC_STATUS, 1, nullptr, &GPU_status);
return GPU_status == GL_UNSIGNALED;
}
}
These two functions are part of a class and it would get a little messy and complicated if I had to post all that here (if more code is needed, tell me). So every loop when the class is told to do the computation, it first checks, if the GPU is busy. If it's done, the result is copied to CPU-memory (or a calculation is started), else it returns to main without doing anything else. Anyway, this approach works in that it produces the right result. But my main loop is still blocked.
Doing some timing revealed that CalculateSomething, Render (and everything else) runs fast (as I would expect them to do). But now glfwSwapBuffers takes >3000ms (depending on how long the calculations of the compute shader take).
Shouldn't it be possible to switch buffers while a compute shader is running? Rendering the result seems to work fine and without delay (as long as the compute shader is not done yet, the old result should get rendered). Or am I missing something here (queued OpenGL calls get processed before glfwSwapBuffers does something?)?
Edit:
I'm not sure why this question got closed and what additional information is needed (maybe other than the OS, which would be Windows). As for "desired behavior": Well - I'd like the glfwSwapBuffers-call not to block my main loop. For additional information, please ask...
As pointed out by Erdal Küçük an implicit call of glFlush might cause latency. I did put this call before glfwSwapBuffer for testing purposes and timed it - no latency here...
I'm sure, I can't be the only one who ever ran into this problem. Maybe someone could try and reproduce it? Simply put a compute shader in the main-loop that takes a few seconds to do it's calculations. I have read somewhere that similar problems occur escpecially when calling glMapBuffer. This seems to be an issue with the GPU-driver (mine would be an integrated Intel-GPU). But nowhere have I read about latencies above 200ms...
Solved a similar issue with GL_PIXEL_PACK_BUFFER effectively used as an offscreen compute shader. The approach with fences is correct, but you then need to have a separate function that checks the status of the fence using glGetSynciv to read the GL_SYNC_STATUS. The solution (admittedly in Java) can be found here.
An explanation for why this is necessary can be found in: in #Nick Clark's comment answer:
Every call in OpenGL is asynchronous, except for the frame buffer swap, which stalls the calling thread until all submitted functions have been executed. Thus, the reason why glfwSwapBuffers seems to take so long.
The relevant portion from the solution is:
public void finishHMRead( int pboIndex ){
int[] length = new int[1];
int[] status = new int[1];
GLES30.glGetSynciv( hmReadFences[ pboIndex ], GLES30.GL_SYNC_STATUS, 1, length, 0, status, 0 );
int signalStatus = status[0];
int glSignaled = GLES30.GL_SIGNALED;
if( signalStatus == glSignaled ){
// Ready a temporary ByteBuffer for mapping (we'll unmap the pixel buffer and lose this) and a permanent ByteBuffer
ByteBuffer pixelBuffer;
texLayerByteBuffers[ pboIndex ] = ByteBuffer.allocate( texWH * texWH );
// map data to a bytebuffer
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, pbos[ pboIndex ] );
pixelBuffer = ( ByteBuffer ) GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, texWH * texWH * 1, GLES30.GL_MAP_READ_BIT );
// Copy to the long term ByteBuffer
pixelBuffer.rewind(); //copy from the beginning
texLayerByteBuffers[ pboIndex ].put( pixelBuffer );
// Unmap and unbind the currently bound pixel buffer
GLES30.glUnmapBuffer( GLES30.GL_PIXEL_PACK_BUFFER );
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, 0 );
Log.i( "myTag", "Finished copy for pbo data for " + pboIndex + " at: " + (System.currentTimeMillis() - initSphereStart) );
acknowledgeHMReadComplete();
} else {
// If it wasn't done, resubmit for another check in the next render update cycle
RefMethodwArgs finishHmRead = new RefMethodwArgs( this, "finishHMRead", new Object[]{ pboIndex } );
UpdateList.getRef().addRenderUpdate( finishHmRead );
}
}
Basically, fire off the computer shader, then wait for the glGetSynciv check of GL_SYNC_STATUS to equal GL_SIGNALED, then rebind the GL_SHADER_STORAGE_BUFFER and perform the glMapBuffer operation.

Problem testing DTid.x Direct3D ComputeShader HLSL

I’m attempting to write a slightly simple compute shader that does a simple moving average.
It is my first shader where I had to test DTid.x for certain conditions related to logic.
The shader works, the moving average is calculated as expected, except (ugh), for the case of DTid.x = 0 where I get a bad result.
It seems my testing of value DTid.x is somehow corrupted or not possible for case DTid.x = 0
I may be missing some fundamental understanding how compute shaders work as this piece of code seems super simple but it doesn't work as I'd expect it to.
Hopefully someone can tell me why this code doesn't work for case DTid.x = 0
For example, I simplified the shader to...
[numthreads(1024, 1, 1)]
void CSSimpleMovingAvgDX(uint3 DTid : SV_DispatchThreadID)
{
// I added below trying to limit the logic?
// I initially had it check for a range like >50 and <100 and this did work as expected.
// But I saw that my value at DTid.x = 0 was corrupted and I started to work on solving why. But no luck.
// It is just the case of DTid.x = 0 where this shader does not work.
if (DTid.x > 0)
{
return;
}
nAvgCnt = 1;
ft0 = asfloat(BufferY0.Load(DTid.x * 4)); // load data at actual DTid.x location
if (DTid.x > 0) // to avoid loading a second value for averaging
{
// somehow this code is still being called for case DTid.x = 0 ?
nAvgCnt = nAvgCnt + 1;
ft1 = asfloat(BufferY0.Load((DTid.x - 1) * 4)); // load data value at previous DTid.x location
}
if (nAvgCnt > 1) // If DTid.X was larger than 0, then we should have loaded ft1 and we can avereage ft0 and ft1
{
result = ((ft0 + ft1) / ((float)nAvgCnt));
}
else
{
result = ft0;
}
// And when I add code below, which should override above code, the result is still corrupted? //
if (DTid.x < 2)
result = ft0;
llByteOffsetLS = ((DTid.x) * dwStrideSomeBuffer);
BufferOut0.Store(llByteOffsetLS, asuint(result)); // store result, where all good except for case DTid.x = 0
}
I am compiling the shader with FXC. My shader was slightly more involved than above, I added the /Od option and the code behaved as expected. Without the /Od option I tried to refactor the code over and over with no luck but eventually I changed variable names for every possible section to make sure the compiler would treat them separately and eventually success. So, the lesson I learned is never reuse a variable in any way. Another solution, worse case, would be to decompile the compiled shader to understand how it was optimized. If attempting a large shader with several conditions/branches, I'd start with /Od and then eventually remove, and do not reuse variables, else you may start chasing problems that are not truly problems.

glGetProgramIv() doesn't change the pointer sent to it

I have this function:
void GetAllUniforms(unsigned int prog) {
int total = -1;
glGetProgramiv(prog, GL_LINK_STATUS, &total);
printf("total & prog: %d %d\n", total, prog);
for(int i=0; i<total; ++i) {
int name_len=-1, num=-1;
GLenum type = GL_ZERO;
char name[100];
glGetActiveUniform(prog, i, sizeof(name)-1,
&name_len, &num, &type, name);
name[name_len] = 0;
unsigned int location = glGetUniformLocation(prog, name);
printf("Name: %s/Number: %d/Location: %d\n", name, num, location);
}
}
It's purpose is simple, get all uniforms and print them,
however, let's say I ran this function, right?
It would give me:
Running newpmo/build/out
total & prog: -1 1
It's clear, there are no active uniforms in my glsl shaders, which is ridiculous cause I have a WorldMatrix multiplying a Position Vector to get Gl_Position, and upon further debugging I found that if I set total to another number, let's say 5, I get this:
Running newpmo/build/out
total & prog: 5 1
Name: /Number: -1/Location: -1
Name: /Number: -1/Location: -1
Name: /Number: -1/Location: -1
Name: /Number: -1/Location: -1
Name: /Number: -1/Location: -1
This is pretty odd, cause it shows that glGetProgramiv is not actually affecting total in any way, even though I sent it my address and the program is compiled and linked properly!
How is that possible?
GL_LINK_STATUS returns whether the program has been successfully linked or not. So even if that line had worked correctly, the result is not a value you should be looping over. Given what you're trying to do, the value you want to query is GL_ACTIVE_UNIFORMS.
However, on to your principle problem. You are getting some kind of OpenGL error from this (BTW: always check for OpenGL errors). I know that because the only reason that function would not set the variable is if it errors out. So go fetch the error.
Since it's not an obvious error (like passing the wrong enum or something), what's probably happening is not something that's visible from here. That is, you're not passing a valid program to this function. You could be passing a shader object instead of a program object, or you could be passing a value you didn't get from glCreateProgram.

Why can't I add an int member to my GLSL shader input/output block?

I'm getting an "invalid operation" error when attempting to call glUseProgram against the fragment shader below. The error only occurs when I try to add an int member to the block definition. Note that I am keeping the block definition the same in both the vertex and fragment shaders. I don't even have to access it! Merely adding that field to the vertex and fragment shader copies of the block definition cause the program to fail.
#version 450
...
in VSOutput // and of course "out" in the vertex shader
{
vec4 color;
vec4 normal;
vec2 texCoord;
//int foo; // uncommenting this line causes "invalid operation"
} vs_output;
I also get the same issue when trying to use free standing in/out variables of the same type, though in those cases, I only get the issue if accessing those variables directly; if I ignore them, I assume the compiler optimizes them away and thus error doesn't occur. It's almost like I'm only allowed to pass around vectors and matrices...
What am I missing here? I haven't been able to find anything in the documentation that would indicate that this should be an issue.
EDIT: padding it out with float[2] to force the int member onto the next 16-byte boundary did not work either.
EDIT: solved, as per the answer below. Turns out I could have figured this out much more quickly if I'd checked the shader program's info log. Here's my code to do that:
bool checkProgramLinkStatus(GLuint programId)
{
auto log = logger("Shaders");
GLint status;
glGetProgramiv(programId, GL_LINK_STATUS, &status);
if(status == GL_TRUE)
{
log << "Program link successful." << endlog;
return true;
}
return false;
}
bool checkProgramInfoLog(GLuint programId)
{
auto log = logger("Shaders");
GLint infoLogLength;
glGetProgramiv(programId, GL_INFO_LOG_LENGTH, &infoLogLength);
GLchar* strInfoLog = new GLchar[infoLogLength + 1];
glGetProgramInfoLog(programId, infoLogLength, NULL, strInfoLog);
if(infoLogLength == 0)
{
log << "No error message was provided" << endlog;
}
else
{
log << "Program link error: " << std::string(strInfoLog) << endlog;
}
return false;
}
(As already pointed out in the comments): The GL will never interpolate integer types. To quote the GLSL spec (Version 4.5) section 4.3.4 "input variables":
Fragment shader inputs that are signed or unsigned integers, integer vectors, or any double-precision
floating-point type must be qualified with the interpolation qualifier flat.
This of couse also applies to the corresponding outputs in the previous stage.

OpenCL struct values correct on CPU but not on GPU

I do have a struct in a file wich is included by the host code and the kernel
typedef struct {
float x, y, z,
dir_x, dir_y, dir_z;
int radius;
} WorklistStruct;
I'm building this struct in my c++ host code and passing it via a buffer to the OpenCL kernel.
If I'm choosing an CPU device for computation I will get the following result:
printf ( "item:[%f,%f,%f][%f,%f,%f]%d,%d\n", item.x, item.y, item.z, item.dir_x, item.dir_y,
item.dir_z , item.radius ,sizeof(float));
Host:
item:[20.169043,7.000000,34.933712][0.000000,-3.000000,0.000000]1,4
Device (CPU):
item:[20.169043,7.000000,34.933712][0.000000,-3.000000,0.000000]1,4
And if I choose a GPU device (AMD) for computation weird things are happening:
Host:
item:[58.406261,57.786015,58.137501][2.000000,2.000000,2.000000]2,4
Device (GPU):
item:[58.406261,2.000000,0.000000][0.000000,0.000000,0.000000]0,0
Notable is that the sizeof(float) is garbage on the gpu.
I assume there is a problem with the layouts of floats on different devices.
Note: the struct is contained in an array of structs of this type and every struct in this array is garbage on GPU
Anyone does have an idea why this is the case and how I can predict this?
EDIT I added an %d at the and and replaced it by an 1, the result is:1065353216
EDIT: here two structs wich I'm using
typedef struct {
float x, y, z,//base coordinates
dir_x, dir_y, dir_z;//directio
int radius;//radius
} WorklistStruct;
typedef struct {
float base_x, base_y, base_z; //base point
float radius;//radius
float dir_x, dir_y, dir_z; //initial direction
} ReturnStruct;
I tested some other things, it looks like a problem with printf. The values seems to be right. I passed the arguments to the return struct, read them and these values were correct.
I don't want to post all of the related code, this would be a few hundred lines.
If noone has an idea I would compress this a bit.
Ah, and for printing I'm using #pragma OPENCL EXTENSION cl_amd_printf : enable.
Edit:
Looks really like a problem with printf. I simply don't use it anymore.
There is a simple method to check what happens:
1 - Create host-side data & initialize it:
int num_points = 128;
std::vector<WorklistStruct> works(num_points);
std::vector<ReturnStruct> returns(num_points);
for(WorklistStruct &work : works){
work = InitializeItSomehow();
std::cout << work.x << " " << work.y << " " << work.z << std::endl;
std::cout << work.radius << std::endl;
}
// Same stuff with returns
...
2 - Create Device-side buffers using COPY_HOST_PTR flag, map it & check data consistency:
cl::Buffer dev_works(..., COPY_HOST_PTR, (void*)&works[0]);
cl::Buffer dev_rets(..., COPY_HOST_PTR, (void*)&returns[0]);
// Then map it to check data
WorklistStruct *mapped_works = dev_works.Map(...);
ReturnStruct *mapped_rets = dev_rets.Map(...);
// Output values & unmap buffers
...
3 - Check data consistency on Device side as you did previously.
Also, make sure that code (presumably - header), which is included both by kernel & host-side code is pure OpenCL C (AMD compiler sometimes can "swallow" some errors) and that you've imported directory for includes searching, when building OpenCL kernel ("-I" flag at clBuildProgramm stage)
Edited:
At every step, please collect return codes (or catch exceptions). Beside that, "-Werror" flag at clBuildProgramm stage can also be helpfull.
It looks like I used the wrong OpenCL headers for compiling. If I try the code on the Intel platform(OpenCL 1.2) everything is fine. But on my AMD platform (OpenCL 1.1) I get weird values.
I will try other headers.