I have this function:
void GetAllUniforms(unsigned int prog) {
int total = -1;
glGetProgramiv(prog, GL_LINK_STATUS, &total);
printf("total & prog: %d %d\n", total, prog);
for(int i=0; i<total; ++i) {
int name_len=-1, num=-1;
GLenum type = GL_ZERO;
char name[100];
glGetActiveUniform(prog, i, sizeof(name)-1,
&name_len, &num, &type, name);
name[name_len] = 0;
unsigned int location = glGetUniformLocation(prog, name);
printf("Name: %s/Number: %d/Location: %d\n", name, num, location);
}
}
It's purpose is simple, get all uniforms and print them,
however, let's say I ran this function, right?
It would give me:
Running newpmo/build/out
total & prog: -1 1
It's clear, there are no active uniforms in my glsl shaders, which is ridiculous cause I have a WorldMatrix multiplying a Position Vector to get Gl_Position, and upon further debugging I found that if I set total to another number, let's say 5, I get this:
Running newpmo/build/out
total & prog: 5 1
Name: /Number: -1/Location: -1
Name: /Number: -1/Location: -1
Name: /Number: -1/Location: -1
Name: /Number: -1/Location: -1
Name: /Number: -1/Location: -1
This is pretty odd, cause it shows that glGetProgramiv is not actually affecting total in any way, even though I sent it my address and the program is compiled and linked properly!
How is that possible?
GL_LINK_STATUS returns whether the program has been successfully linked or not. So even if that line had worked correctly, the result is not a value you should be looping over. Given what you're trying to do, the value you want to query is GL_ACTIVE_UNIFORMS.
However, on to your principle problem. You are getting some kind of OpenGL error from this (BTW: always check for OpenGL errors). I know that because the only reason that function would not set the variable is if it errors out. So go fetch the error.
Since it's not an obvious error (like passing the wrong enum or something), what's probably happening is not something that's visible from here. That is, you're not passing a valid program to this function. You could be passing a shader object instead of a program object, or you could be passing a value you didn't get from glCreateProgram.
Related
The bounty expires in 7 days. Answers to this question are eligible for a +50 reputation bounty.
Paul Aner is looking for a canonical answer:
I think the reason for this question is clear: I want the main-loop to NOT lock while a compute shader is processing larger amounts of data. I could try and seperate the data into smaller snippets, but if the computations were done on CPU, I would simply start a thread and everything would run nice and smoothly. Altough I of course would have to wait until the calculation-thread delivers new data to update the screen - the GUI (ImGUI) would not lock up...
I have written a program that does some calculations on a compute shader and the returned data is then being displayed. This works perfectly, except that the program execution is blocked while the shader is running (see code below) and depending on the parameters, this can take a while:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
GLfloat* mapped = (GLfloat*)(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));
memcpy(Result, mapped, sizeof(GLfloat) * X * Y);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}
void main
{
// Initialization stuff
// ...
while (glfwWindowShouldClose(Window) == 0)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glfwPollEvents();
glfwSwapInterval(2); // Doesn't matter what I put here
CalculatateSomething(Result);
Render(Result);
glfwSwapBuffers(Window.WindowHandle);
}
}
To keep the main loop running while the compute shader is calculating, I changed CalculateSomething to something like this:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
GPU_sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
}
bool GPU_busy()
{
GLint GPU_status;
if (GPU_sync == NULL)
return false;
else
{
glGetSynciv(GPU_sync, GL_SYNC_STATUS, 1, nullptr, &GPU_status);
return GPU_status == GL_UNSIGNALED;
}
}
These two functions are part of a class and it would get a little messy and complicated if I had to post all that here (if more code is needed, tell me). So every loop when the class is told to do the computation, it first checks, if the GPU is busy. If it's done, the result is copied to CPU-memory (or a calculation is started), else it returns to main without doing anything else. Anyway, this approach works in that it produces the right result. But my main loop is still blocked.
Doing some timing revealed that CalculateSomething, Render (and everything else) runs fast (as I would expect them to do). But now glfwSwapBuffers takes >3000ms (depending on how long the calculations of the compute shader take).
Shouldn't it be possible to switch buffers while a compute shader is running? Rendering the result seems to work fine and without delay (as long as the compute shader is not done yet, the old result should get rendered). Or am I missing something here (queued OpenGL calls get processed before glfwSwapBuffers does something?)?
Edit:
I'm not sure why this question got closed and what additional information is needed (maybe other than the OS, which would be Windows). As for "desired behavior": Well - I'd like the glfwSwapBuffers-call not to block my main loop. For additional information, please ask...
As pointed out by Erdal Küçük an implicit call of glFlush might cause latency. I did put this call before glfwSwapBuffer for testing purposes and timed it - no latency here...
I'm sure, I can't be the only one who ever ran into this problem. Maybe someone could try and reproduce it? Simply put a compute shader in the main-loop that takes a few seconds to do it's calculations. I have read somewhere that similar problems occur escpecially when calling glMapBuffer. This seems to be an issue with the GPU-driver (mine would be an integrated Intel-GPU). But nowhere have I read about latencies above 200ms...
Solved a similar issue with GL_PIXEL_PACK_BUFFER effectively used as an offscreen compute shader. The approach with fences is correct, but you then need to have a separate function that checks the status of the fence using glGetSynciv to read the GL_SYNC_STATUS. The solution (admittedly in Java) can be found here.
An explanation for why this is necessary can be found in: in #Nick Clark's comment answer:
Every call in OpenGL is asynchronous, except for the frame buffer swap, which stalls the calling thread until all submitted functions have been executed. Thus, the reason why glfwSwapBuffers seems to take so long.
The relevant portion from the solution is:
public void finishHMRead( int pboIndex ){
int[] length = new int[1];
int[] status = new int[1];
GLES30.glGetSynciv( hmReadFences[ pboIndex ], GLES30.GL_SYNC_STATUS, 1, length, 0, status, 0 );
int signalStatus = status[0];
int glSignaled = GLES30.GL_SIGNALED;
if( signalStatus == glSignaled ){
// Ready a temporary ByteBuffer for mapping (we'll unmap the pixel buffer and lose this) and a permanent ByteBuffer
ByteBuffer pixelBuffer;
texLayerByteBuffers[ pboIndex ] = ByteBuffer.allocate( texWH * texWH );
// map data to a bytebuffer
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, pbos[ pboIndex ] );
pixelBuffer = ( ByteBuffer ) GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, texWH * texWH * 1, GLES30.GL_MAP_READ_BIT );
// Copy to the long term ByteBuffer
pixelBuffer.rewind(); //copy from the beginning
texLayerByteBuffers[ pboIndex ].put( pixelBuffer );
// Unmap and unbind the currently bound pixel buffer
GLES30.glUnmapBuffer( GLES30.GL_PIXEL_PACK_BUFFER );
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, 0 );
Log.i( "myTag", "Finished copy for pbo data for " + pboIndex + " at: " + (System.currentTimeMillis() - initSphereStart) );
acknowledgeHMReadComplete();
} else {
// If it wasn't done, resubmit for another check in the next render update cycle
RefMethodwArgs finishHmRead = new RefMethodwArgs( this, "finishHMRead", new Object[]{ pboIndex } );
UpdateList.getRef().addRenderUpdate( finishHmRead );
}
}
Basically, fire off the computer shader, then wait for the glGetSynciv check of GL_SYNC_STATUS to equal GL_SIGNALED, then rebind the GL_SHADER_STORAGE_BUFFER and perform the glMapBuffer operation.
I need an implementation of upper_bound as described in the STL for my metal compute kernel. Not having anything in the metal standard library, I essentially copied it from <algorithm> into my shader file like so:
static device float* upper_bound( device float* first, device float* last, float val)
{
ptrdiff_t count = last - first;
while( count > 0){
device float* it = first;
ptrdiff_t step = count/2;
it += step;
if( !(val < *it)){
first = ++it;
count -= step + 1;
}else count = step;
}
return first;
}
I created a simple kernel to test it like so:
kernel void upper_bound_test(
device float* input [[buffer(0)]],
device uint* output [[buffer(1)]]
)
{
device float* where = upper_bound( input, input + 5, 3.1);
output[0] = where - input;
}
Which for this test has a hardcoded input size and search value. I also hardcoded a 5 element input buffer on the framework side as you'll see below. This kernel I expect to return the index of the first input greater than 3.1
It doesn't work. In fact output[0] is never written--as I preloaded the buffer with a magic number to see if it gets over-written. It doesn't. In fact after waitUntilCompleted, commandBuffer.error looks like this:
Error Domain = MTLCommandBufferErrorDomain
Code = 1
NSLocalizedDescription = "IOAcceleratorFamily returned error code 3"
What does error code 3 mean? Did my kernel get killed before it had a chance to finish?
Further, I tried just a linear search version of upper_bound like so:
static device float* upper_bound2( device float* first, device float* last, float val)
{
while( first < last && *first <= val)
++first;
return first;
}
This one works (sort-of). I have the same problem with a binary search lower_bound from <algorithm>--yet a naive linear version works (sort-of). BTW, I tested my STL copied versions from straight C-code (with device removed obviously) and they work fine outside of shader-land. Please tell me I'm doing something wrong and this is not a metal compiler bug.
Now about that "sort-of" above: the linear search versions work on a 5s and mini-2 (A7s) (returns index 3 in the example above), but on a 6+ (A8) it gives the right answer + 2^31. What the heck! Same exact code. Note on the framework side I use uint32_t and on the shader side I use uint--which are the same thing. Note also that every pointer subtraction (ptrdiff_t are signed 8-byte things) are small non-negative values. Why is the 6+ setting that high order bit? And of course, why don't my real binary search versions work?
Here is the framework side stuff:
id<MTLFunction> upperBoundTestKernel = [_library newFunctionWithName: #"upper_bound_test"];
id <MTLComputePipelineState> upperBoundTestPipelineState = [_device
newComputePipelineStateWithFunction: upperBoundTestKernel
error: &err];
float sortedNumbers[] = {1., 2., 3., 4., 5.};
id<MTLBuffer> testInputBuffer = [_device
newBufferWithBytes:(const void *)sortedNumbers
length: sizeof(sortedNumbers)
options: MTLResourceCPUCacheModeDefaultCache];
id<MTLBuffer> testOutputBuffer = [_device
newBufferWithLength: sizeof(uint32_t)
options: MTLResourceCPUCacheModeDefaultCache];
*(uint32_t*)testOutputBuffer.contents = 42;//magic number better get clobbered
id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];
id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];
[commandEncoder setComputePipelineState: upperBoundTestPipelineState];
[commandEncoder setBuffer: testInputBuffer offset: 0 atIndex: 0];
[commandEncoder setBuffer: testOutputBuffer offset: 0 atIndex: 1];
[commandEncoder
dispatchThreadgroups: MTLSizeMake( 1, 1, 1)
threadsPerThreadgroup: MTLSizeMake( 1, 1, 1)];
[commandEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
uint32_t answer = *(uint32_t*)testOutputBuffer.contents;
Well, I've found a solution/work-around. I guessed it was a pointer-aliasing problem since first and last pointed into the same buffer. So I changed them to offsets from a single pointer variable. Here's a re-written upper_bound2:
static uint upper_bound2( device float* input, uint first, uint last, float val)
{
while( first < last && input[first] <= val)
++first;
return first;
}
And a re-written test kernel:
kernel void upper_bound_test(
device float* input [[buffer(0)]],
device uint* output [[buffer(1)]]
)
{
output[0] = upper_bound2( input, 0, 5, 3.1);
}
This worked--completely. That is, not only did it fix the "sort-of" problem for the linear search, but a similarly re-written binary search worked too. I don't want to believe this though. The metal shader language is supposed to be a subset of C++, yet standard pointer semantics don't work? Can I really not compare or subtract pointers?
Anyway, I don't recall seeing any docs saying there can be no pointer aliasing or what declaration incantation would help me here. Any more help?
[UPDATE]
For the record, as pointed out by "slime" on Apple's dev forum:
https://developer.apple.com/library/ios/documentation/Metal/Reference/MetalShadingLanguageGuide/func-var-qual/func-var-qual.html#//apple_ref/doc/uid/TP40014364-CH4-SW3
"Buffers (device and constant) specified as argument values to a graphics or kernel function cannot be aliased—that is, a buffer passed as an argument value cannot overlap another buffer passed to a separate argument of the same graphics or kernel function."
But it's also worth noting that upper_bound() is not a kernel function and upper_bound_test() is not passed aliased arguments. What upper_bound_test() does do is create a local temporary that points into the same buffer as one of its arguments. Perhaps the docs should say what it means, something like: "No pointer aliasing to device and constant memory in any function is allowed including rvalues." I don't actually know if this is too strong.
I am writing code and recently, I found some error. The simplified version is shown below.
#include <stdio.h>
#include <cuda.h>
#define DEBUG 1
inline void check_cuda_errors(const char *filename, const int line_number)
{
#ifdef DEBUG
cudaThreadSynchronize();
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess)
{
printf("CUDA error at %s:%i: %s\n", filename, line_number, cudaGetErrorString(error));
exit(-1);
}
#endif
}
__global__ void make_input_matrix_zp()
{
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
printf("col: %d (%d*%d+%d) row: %d (%d*%d+%d) \n", col, blockIdx.x, blockDim.x, threadIdx.x, row, blockIdx.y, blockDim.y, threadIdx.y);
}
int main()
{
dim3 blockDim(16, 16, 1);
dim3 gridDim(6, 6, 1);
make_input_matrix_zp<<<gridDim, blockDim>>>();
//check_cuda_errors(__FILE__, __LINE__);
return 0;
}
The first inline function is for checking the error in the cuda.
The second kernel function simply calculates current thread's index written in 'row' and 'col' and print these values. I guess there are no problem in inline function since it is from other reliable source.
The problem is, when I run the program, it does not execute kernel function even though it is called in the main function. However, if I delete the comment notation '//' in front of the
check_cuda_error
the program seems to enter the kernel function and it shows some printed value by printf function. But, it does not shows full combination of 'col' and 'row' indexes. In detail, the 'blockDim.y' does not change much. It only shows values of 4 and 5, but not 0, 1, 2, 3.
What I do not understand first.
As far as I know, the 'gridDim' means the dimension of the blocks. That means the block indexes have combination of (0,0)(0,1)(0,2)(0,3)(0,4)(0,5)(1,0)(1,1)(1,2)(1,3)... and so on. Also the size of the each block is 16 by 16. However, if you run this program, it does not show full combination. I just shows several combinations and it ends.
What I do not understand second.
Why the kernel function is dependent to the function named 'check_cuda_errors'? When this function exists, the program at least runs although imperfectly. However, when this error checking function is commented, the kernel function does not show any printed values.
This is very simple code but I couldn't find the problem for several days. Is there anything that I missed? Or do I know something wrong?
My working environment is like this.
"GeForce GT 630"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 2.1
Ubuntu 14.04
The CUDA GPU printf subsystem relies on a FIFO buffer to store printed output. If your output exceeds the size of the buffer, some or all of the previous content of the FIFO buffer will be overwritten by subsequent output. This is what will be happening in this case.
You can query and change the size of the buffer using the runtime API with cudaDeviceGetLimit and cudaDeviceSetLimit. If your device has the resources available to expand the limit, you should be able to see all the output your code emits.
As an aside, relying on the kernel printf feature for anything other than simple diagnostics or lightweight debugging, is a terrible idea, and you have probably just proven to yourself that you should be looking at other methods of verifying the correctness of your code.
Regarding your second question, the printf buffer is flushed to output only when the host synchronizes with the device. For example, with a call to cudaDeviceSynchronize, cudaThreadSynchronize, cudaMemcpy, and others (see B.17.2 Limitations of the Formatted Output appendix).
When check_cuda_errors is uncommented, calling cudaThreadSynchronize is what triggers the buffer to be printed. When it is commented, the main thread simply terminates before the kernel gets to run to completion, and nothing else happens.
I appear to have coded a class that travels backwards in time. Allow me to explain:
I have a function, OrthogonalCamera::project(), that sets a matrix to a certain value. I then print out the value of that matrix, as such.
cam.project();
std::cout << "My Projection Matrix: " << std::endl << ProjectionMatrix::getMatrix() << std::endl;
cam.project() pushes a matrix onto ProjectionMatrix's stack (I am using the std::stack container), and ProjectionMatrix::getMatrix() just returns the stack's top element. If I run just this code, I get the following output:
2 0 0 0
0 7.7957 0 0
0 0 -0.001 0
-1 -1 -0.998 1
But if I run the code with these to lines after the std::cout call
float *foo = new float[16];
Mat4 fooMatrix = foo;
Then I get this output:
2 0 0 0
0 -2 0 0
0 0 -0.001 0
-1 1 -0.998 1
My question is the following: what could I possibly be doing such that code executed after I print a value changes the value being printed?
Some of the functions I'm using:
static void load(Mat4 &set)
{
if(ProjectionMatrix::matrices.size() > 0)
ProjectionMatrix::matrices.pop();
ProjectionMatrix::matrices.push(set);
}
static Mat4 &getMatrix()
{
return ProjectionMatrix::matrices.top();
}
and
void OrthogonalCamera::project()
{
Mat4 orthProjection = { { 2.0f / (this->r - this->l), 0, 0, -1 * ((this->r + this->l) / (this->r - this->l)) },
{ 0, 2.0f / (this->t - this->b), 0, -1 * ((this->t + this->b) / (this->t - this->b)) },
{ 0, 0, -2.0f / (this->farClip - this->nearClip), -1 * ((this->farClip + this->nearClip) / (this->farClip - this->nearClip)) },
{ 0, 0, 0, 1 } }; //this is apparently the projection matrix for an orthographic projection.
orthProjection = orthProjection.transpose();
ProjectionMatrix::load(orthProjection);
}
EDIT: whoever formatted my code, thank you. I'm not really too good with the formatting here, and it looks much nicer now :)
FURTHER EDIT: I have verified that the initialization of fooMatrix is running after I call std::cout.
UPTEENTH EDIT: Here is the function that initializes fooMatrix:
typedef Matrix<float, 4, 4> Mat4;
template<typename T, unsigned int rows, unsigned int cols>
Matrix<T, rows, cols>::Matrix(T *set)
{
this->matrixData = new T*[rows];
for (unsigned int i = 0; i < rows; i++)
{
this->matrixData[i] = new T[cols];
}
unsigned int counter = 0; //because I was too lazy to use set[(i * cols) + j]
for (unsigned int i = 0; i < rows; i++)
{
for (unsigned int j = 0; j < cols; j++)
{
this->matrixData[i][j] = set[counter];
counter++;
}
}
}
g64th EDIT: This isn't just an output problem. I actually have to use the value of the matrix elsewhere, and it's value aligns with the described behaviours (whether or not I print it).
TREE 3rd EDIT: Running it through the debugger gave me a yet again different value:
-7.559 0 0 0
0 -2 0 0
0 0 -0.001 0
1 1 -0.998 1
a(g64, g64)th EDIT: the problem does not exist compiling on linux. Just on Windows with MinGW. Could it be a compiler bug? That would make me sad.
FINAL EDIT: It works now. I don't know what I did, but it works. I've made sure I was using an up-to-date build that didn't have the code that ensures causality still functions, and it works. Thank you for helping me figure this out, stackoverflow community. As always you've been helpful and tolerant of my slowness. I'll by hypervigilant for any undefined behaviours or pointer screw-ups that can cause this unpredictability.
You're not writing your program instruction by instruction. You are describing its behavior to a C++ compiler, which then tries to express the same in machine code.
The compiler is allowed to reorder your code, as long as the observable behavior does not change.
In other words, the compiler is almost certainly reordering your code. So why does the observable behavior change?
Because your code exhibits undefined behavior.
Again, you are writing C++ code. C++ is a standard, a specification saying what the "meaning" of your code is. You're working under a contract that "As long as I, the programmer, write code that can be interpreted according to the C++ standard, then you, the compiler, will generate an executable whose behavior matches that of my source code".
If your code does anything not specified in this standard, then you have violated this contract. You have fed the compiler code whose behavior can not be interpreted according to the C++ standard. And then all bets are off. The compiler trusted you. It believed that you would fulfill the contract. It analyzed your code and generated an executable based on the assumption that you would write code that had a well-defined meaning. You did not, so the compiler was working under a false assumption. And then anything it builds on top of that assumption is also invalid.
Garbage in, garbage out. :)
Sadly, there's no easy way to pinpoint the error. You can carefully study ever piece of your code, or you can try stepping through the offending code in the debugger. Or break into the debugger at the point where the "wrong" value is seen, and study the disassembly and how you got there.
It's a pain, but that's undefined behavior for you. :)
Static analysis tools (Valgrind on Linux, and depending on your version of Visual Studio, the /analyze switch may or may not be available. Clang has a similar option built in) may help
What is your compiler? If you are compiling with gcc, try turning on thorough and verbose warnings. If you are using Visual Studio, set your warnings to /W4 and treat all warnings as errors.
Once you have done that and can still compile, if the bug still exists, then run the program through Valgrind. It is likely that at some point in your program, at an earlier point, you read past the end of some array and then write something. That something you write is overwriting what you're trying to print. Therefore, when you put more things on the stack, reading past the end of some array will put you in a completely different location in memory, so you are instead overwriting something else. Valgrind was made to catch stuff like that.
Using Radeon 3870HD I've run into strange behavior with AMD's drivers (updated yesterday to newest version available).
First of all I'd like to note that whole code ran without problem on NVIDIA GeForce 540M and that glGetUniformLocation isn't failing all the time.
So my problem is that I get from glGetUniformLocation strange values with one of my shader programs used in app, while the other shader program doesn't have such flaw. I switch that shaders between frames so I'm sure it isn't temporary and is related to that shader. By strange values I mean something like 17-26, while I have only 9 uniforms present. I'm using my interface for shaders that afterwards queries for type of GLSL variable with just obtained variable and as side effect I also query its name. For all those 17-26 locations I got returned the name wasn't set and same for type. Now I've got idea to debug into interface which is separate library and change those values to something I'd expect: 0-8. Using debugger I changed these and indeed they returned proper variable name in that shader and also type was correct.
My question is how possibly could the code that is working always with NVIDIA and also with that other shader on Radeon behave differently for another shader, that is treated the same way, fail?
I include related part of interface for this:
//this fails to return correct value
m_location = glGetUniformLocation(m_program.getGlID(), m_name.c_str());
printGLError();
if(m_location == -1){
std::cerr << "ERROR: Uniform " << m_name << " doesn't exist in program" << std::endl;
return FAILURE;
}
GLsizei charSize = m_name.size()+1, size = 0, length = 0;
GLenum type = 0;
GLchar* name = new GLchar[charSize];
name[charSize-1] = '\0';
glGetActiveUniform(m_program.getGlID(), m_location, charSize, &length, &size, &type, name);
delete name; name = 0;
if(!TypeResolver::resolve(type, m_type))
return FAILURE;
m_prepared = true;
m_applied = false;
The index you pass to glGetActiveUniform is not supposed to be a uniform location. Uniform locations are only used with glUniform calls; nothing else.
The index you pass to glGetActiveUniform is just a index between 0 and the value returned by glGetProgram(GL_ACTIVE_UNIFORMS). It is used to ask what uniforms exist and to inspect the properties of those uniforms.
Your code works on NVIDIA only because you got lucky. The OpenGL specification doesn't guarantee that the order of uniform locations is the same as the order of active uniform indices. AMD's drivers don't work that way, so your code doesn't work.