The bounty expires in 7 days. Answers to this question are eligible for a +50 reputation bounty.
Paul Aner is looking for a canonical answer:
I think the reason for this question is clear: I want the main-loop to NOT lock while a compute shader is processing larger amounts of data. I could try and seperate the data into smaller snippets, but if the computations were done on CPU, I would simply start a thread and everything would run nice and smoothly. Altough I of course would have to wait until the calculation-thread delivers new data to update the screen - the GUI (ImGUI) would not lock up...
I have written a program that does some calculations on a compute shader and the returned data is then being displayed. This works perfectly, except that the program execution is blocked while the shader is running (see code below) and depending on the parameters, this can take a while:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
GLfloat* mapped = (GLfloat*)(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));
memcpy(Result, mapped, sizeof(GLfloat) * X * Y);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}
void main
{
// Initialization stuff
// ...
while (glfwWindowShouldClose(Window) == 0)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glfwPollEvents();
glfwSwapInterval(2); // Doesn't matter what I put here
CalculatateSomething(Result);
Render(Result);
glfwSwapBuffers(Window.WindowHandle);
}
}
To keep the main loop running while the compute shader is calculating, I changed CalculateSomething to something like this:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
GPU_sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
}
bool GPU_busy()
{
GLint GPU_status;
if (GPU_sync == NULL)
return false;
else
{
glGetSynciv(GPU_sync, GL_SYNC_STATUS, 1, nullptr, &GPU_status);
return GPU_status == GL_UNSIGNALED;
}
}
These two functions are part of a class and it would get a little messy and complicated if I had to post all that here (if more code is needed, tell me). So every loop when the class is told to do the computation, it first checks, if the GPU is busy. If it's done, the result is copied to CPU-memory (or a calculation is started), else it returns to main without doing anything else. Anyway, this approach works in that it produces the right result. But my main loop is still blocked.
Doing some timing revealed that CalculateSomething, Render (and everything else) runs fast (as I would expect them to do). But now glfwSwapBuffers takes >3000ms (depending on how long the calculations of the compute shader take).
Shouldn't it be possible to switch buffers while a compute shader is running? Rendering the result seems to work fine and without delay (as long as the compute shader is not done yet, the old result should get rendered). Or am I missing something here (queued OpenGL calls get processed before glfwSwapBuffers does something?)?
Edit:
I'm not sure why this question got closed and what additional information is needed (maybe other than the OS, which would be Windows). As for "desired behavior": Well - I'd like the glfwSwapBuffers-call not to block my main loop. For additional information, please ask...
As pointed out by Erdal Küçük an implicit call of glFlush might cause latency. I did put this call before glfwSwapBuffer for testing purposes and timed it - no latency here...
I'm sure, I can't be the only one who ever ran into this problem. Maybe someone could try and reproduce it? Simply put a compute shader in the main-loop that takes a few seconds to do it's calculations. I have read somewhere that similar problems occur escpecially when calling glMapBuffer. This seems to be an issue with the GPU-driver (mine would be an integrated Intel-GPU). But nowhere have I read about latencies above 200ms...
Solved a similar issue with GL_PIXEL_PACK_BUFFER effectively used as an offscreen compute shader. The approach with fences is correct, but you then need to have a separate function that checks the status of the fence using glGetSynciv to read the GL_SYNC_STATUS. The solution (admittedly in Java) can be found here.
An explanation for why this is necessary can be found in: in #Nick Clark's comment answer:
Every call in OpenGL is asynchronous, except for the frame buffer swap, which stalls the calling thread until all submitted functions have been executed. Thus, the reason why glfwSwapBuffers seems to take so long.
The relevant portion from the solution is:
public void finishHMRead( int pboIndex ){
int[] length = new int[1];
int[] status = new int[1];
GLES30.glGetSynciv( hmReadFences[ pboIndex ], GLES30.GL_SYNC_STATUS, 1, length, 0, status, 0 );
int signalStatus = status[0];
int glSignaled = GLES30.GL_SIGNALED;
if( signalStatus == glSignaled ){
// Ready a temporary ByteBuffer for mapping (we'll unmap the pixel buffer and lose this) and a permanent ByteBuffer
ByteBuffer pixelBuffer;
texLayerByteBuffers[ pboIndex ] = ByteBuffer.allocate( texWH * texWH );
// map data to a bytebuffer
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, pbos[ pboIndex ] );
pixelBuffer = ( ByteBuffer ) GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, texWH * texWH * 1, GLES30.GL_MAP_READ_BIT );
// Copy to the long term ByteBuffer
pixelBuffer.rewind(); //copy from the beginning
texLayerByteBuffers[ pboIndex ].put( pixelBuffer );
// Unmap and unbind the currently bound pixel buffer
GLES30.glUnmapBuffer( GLES30.GL_PIXEL_PACK_BUFFER );
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, 0 );
Log.i( "myTag", "Finished copy for pbo data for " + pboIndex + " at: " + (System.currentTimeMillis() - initSphereStart) );
acknowledgeHMReadComplete();
} else {
// If it wasn't done, resubmit for another check in the next render update cycle
RefMethodwArgs finishHmRead = new RefMethodwArgs( this, "finishHMRead", new Object[]{ pboIndex } );
UpdateList.getRef().addRenderUpdate( finishHmRead );
}
}
Basically, fire off the computer shader, then wait for the glGetSynciv check of GL_SYNC_STATUS to equal GL_SIGNALED, then rebind the GL_SHADER_STORAGE_BUFFER and perform the glMapBuffer operation.
Related
I’m attempting to write a slightly simple compute shader that does a simple moving average.
It is my first shader where I had to test DTid.x for certain conditions related to logic.
The shader works, the moving average is calculated as expected, except (ugh), for the case of DTid.x = 0 where I get a bad result.
It seems my testing of value DTid.x is somehow corrupted or not possible for case DTid.x = 0
I may be missing some fundamental understanding how compute shaders work as this piece of code seems super simple but it doesn't work as I'd expect it to.
Hopefully someone can tell me why this code doesn't work for case DTid.x = 0
For example, I simplified the shader to...
[numthreads(1024, 1, 1)]
void CSSimpleMovingAvgDX(uint3 DTid : SV_DispatchThreadID)
{
// I added below trying to limit the logic?
// I initially had it check for a range like >50 and <100 and this did work as expected.
// But I saw that my value at DTid.x = 0 was corrupted and I started to work on solving why. But no luck.
// It is just the case of DTid.x = 0 where this shader does not work.
if (DTid.x > 0)
{
return;
}
nAvgCnt = 1;
ft0 = asfloat(BufferY0.Load(DTid.x * 4)); // load data at actual DTid.x location
if (DTid.x > 0) // to avoid loading a second value for averaging
{
// somehow this code is still being called for case DTid.x = 0 ?
nAvgCnt = nAvgCnt + 1;
ft1 = asfloat(BufferY0.Load((DTid.x - 1) * 4)); // load data value at previous DTid.x location
}
if (nAvgCnt > 1) // If DTid.X was larger than 0, then we should have loaded ft1 and we can avereage ft0 and ft1
{
result = ((ft0 + ft1) / ((float)nAvgCnt));
}
else
{
result = ft0;
}
// And when I add code below, which should override above code, the result is still corrupted? //
if (DTid.x < 2)
result = ft0;
llByteOffsetLS = ((DTid.x) * dwStrideSomeBuffer);
BufferOut0.Store(llByteOffsetLS, asuint(result)); // store result, where all good except for case DTid.x = 0
}
I am compiling the shader with FXC. My shader was slightly more involved than above, I added the /Od option and the code behaved as expected. Without the /Od option I tried to refactor the code over and over with no luck but eventually I changed variable names for every possible section to make sure the compiler would treat them separately and eventually success. So, the lesson I learned is never reuse a variable in any way. Another solution, worse case, would be to decompile the compiled shader to understand how it was optimized. If attempting a large shader with several conditions/branches, I'd start with /Od and then eventually remove, and do not reuse variables, else you may start chasing problems that are not truly problems.
I'm working on an OpenFX plugin to process images in grading/post-production software.
All my processing is done in a series of Metal kernel functions. The image is sent to the GPU as buffers (float array), one for the input and one for the output.
The output is then used by the OpenFX framework for display inside the host application, so up till then I didn't have to take care of it.
I now need to be able to read the output values once the GPU has processed the commands. I have tried to use the "contents" method applied on the buffer but my plugin keeps crashing (in the worst case), or gives me very weird values when it "works" (I'm not supposed to have anything over 1 and under 0, but I get very large numbers, 0 or negative 0, nan... So I assume I have a memory access issue of sorts).
At first I thought it was an issue with Private/Shared memory, so I tried to modify the buffer to be shared. But I'm still struggling!
Full disclosure: I have no specific training in MSL, I'm learning as I go with this project so I might be doing and-or saying very stupid things. I have looked around for hours before deciding to ask for help. Thanks to all who will help out in any way!
Below is the code (without everything that doesn't concern my current issue). If it is lacking anything of interest please let me know.
id < MTLBuffer > srcDeviceBuf = reinterpret_cast<id<MTLBuffer> >(const_cast<float*>(p_Input)) ;
//Below is the destination Image buffer creation the way it used to be done before my edits
//id < MTLBuffer > dstDeviceBuf = reinterpret_cast<id<MTLBuffer> >(p_Output);
//My attempt at creating a Shared memory buffer
MTLResourceOptions bufferOptions = MTLResourceStorageModeShared;
int bufferLength = sizeof(float)*1920*1080*4;
id <MTLBuffer> dstDeviceBuf = [device newBufferWithBytes:p_Output length:bufferLength options:bufferOptions];
id<MTLCommandBuffer> commandBuffer = [queue commandBuffer];
commandBuffer.label = [NSString stringWithFormat:#"RunMetalKernel"];
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
//First method to be computed
[computeEncoder setComputePipelineState:_initModule];
int exeWidth = [_initModule threadExecutionWidth];
MTLSize threadGroupCount = MTLSizeMake(exeWidth, 1, 1);
MTLSize threadGroups = MTLSizeMake((p_Width + exeWidth - 1) / exeWidth,
p_Height, 1);
[computeEncoder setBuffer:srcDeviceBuf offset: 0 atIndex: 0];
[computeEncoder setBuffer:dstDeviceBuf offset: 0 atIndex: 8];
//encodes first module to be executed
[computeEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup: threadGroupCount];
//Modules encoding
if (p_lutexport_on) {
//Fills the image with patch values for the LUT computation
[computeEncoder setComputePipelineState:_LUTExportModule];
[computeEncoder dispatchThreadgroups:threadGroups threadsPerThreadgroup: threadGroupCount];
}
[computeEncoder endEncoding];
[commandBuffer commit];
if (p_lutexport_on) {
//Here is where I try to read the buffer values (and inserts them into a custom object "p_lut_exp_lut"
float* result = static_cast<float*>([dstDeviceBuf contents]);
//Retrieve the output values and populate the LUT with them
int lutLine = 0;
float3 out;
for (int index(0); index < 35937 * 4; index += 4) {
out.x = result[index];
out.y = result[index + 1];
out.z = result[index + 2];
p_lutexp_lut->setValuesAtLine(lutLine, out);
lutLine++;
}
p_lutexp_lut->toFile();
}
If a command buffer includes write or read operations on a given MTLBuffer, you must ensure that these operations complete before reading the buffers contents. You can use the addCompletedHandler: method, waitUntilCompleted method, or custom semaphores to signal that a command buffer has completed execution.
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> cb) {
/* read or write buffer here */
}];
[commandBuffer commit];
My issue is that my function does the job so quickly that i don't see the progress in the screen, here is my display function:
void showMaze(const Maze::Maze &m){
glPushMatrix();
glTranslatef(-1,1, 0);
for(int i=0;i<m.num_Y;i++)
{
for(int j=0;j<m.num_X;j++)
{
char c = m.maze[i][j];
if(c=='1'){ glColor3f(255,255,0);
glRectf(0.05, -0.05, -0.05,0.05);
}
if(c=='0'){ glColor3f(20,60,60);
glRectf(0.05, -0.05, -0.05,0.05);
}
glTranslatef(0.1,0, 0);
}
glTranslatef(-(m.num_X*0.1),-0.1, 0);
}
glPopMatrix();
}
The recursive function:
bool findPath(Maze* M, char m){
showMaze(*M);
glutPostRedisplay();
if(M->isFinish())
return true;
if (m!='s' && M->north()){
update(M, 'n');
if(isNew(M) && findPath(M, 'n') && M->isFinish()){
return true;
}
else{
M->south();
}
}
// .....
else{
return false;
}
}
void render()
{
glClear( GL_COLOR_BUFFER_BIT );
findPath(N,'z');
glutSwapBuffers();
}
In main:
glutDisplayFunc( render );
so my question is how do i get to wait few seconds ( so that i can see the progress ) whenever findpath is called, i've tried glutimeelapsed, sleep,gettickcount but none of do the job correctly.
Edit1:
when i put something like sleep() right after calling showMaze(), nothing is displayed for few seconds, then the final screen shows, am not an expert in c++, but i suppose showMaze() should be executed first, then the sleep() function, or c++ wait to execute the whole function to display results?
Edit2:
i found a solution for the problem, i took X and Y o the maze whenever they change, and stored the in two vectors, and in my drawing function i display fisrt the empty maze, then i slowly add the changed X and Y.
Nothing will be visible on the screen unless you show what you are rendering by swapping buffers (unless you're rendering to the front buffer, but that's iffy). So you can sleep for however long you like in the middle of your recursive function, but you're not going to see anything until you exit that callstack. And by then, you've drawn over everything.
So instead of merely sleeping, you need to use glutSwapBuffers in your recursive call stack when you want something to be visible.
However, this is a bad idea. The render call should not do things like sleep. The system needs your render to actually render, because the screen may need to be updated for a variety of reasons (another window revealing more of your window, etc). You should not have your render loop suspend itself like this.
Instead, what you ought to be doing is executing one segment per render loop execution, relying on glutPostRedisplay to make sure the loop keeps getting called. And you should either base your animation on the time delta between loop executions, or you should use a sleep timer to make sure that at least X time always passes between cycles.
So you need to unroll your recursive call into an iterative process.
Alternatively, if you have access to Boost.Context or Boost.Coroutine, you could use that to handle things. That way, you could keep your recursive structure. When you have rendered everything you want to display, you simply suspending your coroutine back to render, who will swap buffers and return.
I am trying to implement a Binary Search in a compute shader with HLSL. It's not a classic Binary Search as the search key as well as the array values are float. If there is no matching array value to the search key, the search is supposed to return the last index (minIdx and maxIdx match at this point). This is the worst case for classic Binary Search as it takes the maximum number of operations, I am aware of this.
So here's my problem:
My implementation looks like this:
uint BinarySearch (Texture2D<float> InputTexture, float key, uint minIdx, uint maxIdx)
{
uint midIdx = 0;
while (minIdx <= maxIdx)
{
midIdx = minIdx + ((maxIdx + 1 - minIdx) / 2);
if (InputTexture[uint2(midIdx, 0)] == key)
{
// this might be a very rare case
return midIdx;
}
// determine which subarray to search next
else if (InputTexture[uint2(midIdx, 0)] < key)
{
// as we have a decreasingly sorted array, we need to change the
// max index here instead of the min
maxIdx = midIdx - 1;
}
else if (InputTexture[uint2(midIdx, 0)] > key)
{
minIdx = midIdx;
}
}
return minIdx;
}
This leads to my video driver crashing on program execution. I don't get a compile error.
However, if I use an if instead of the while I can execute it and the first iteration works as expected.
I already did a couple of searches and I suspect this might have to do something with dynamic looping in a compute shader. But I have no prior experience with compute shaders and only little experience with HLSL as well, which is why I feel kind of lost.
I am compiling this with cs_5_0.
Could anyone please explain what I am doing wrong or at least hint me to some documentation/explanation? Anything that can get me started on solving and understanding this would be super-appreciated!
DirectCompute shaders are still subject to the Timeout Detection & Recovery (TDR) behavior in the drivers. This basically means if your shader takes more than 2 seconds, the driver assumes the GPU has hung and resets it. This can be challenging with DirectCompute where you intentionally want the shader to run a long while (much longer than rendering usually would). In this case it may be a bug, but it's something to be aware of.
With Windows 8.0 or later, you can allow long-running shaders by using D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT when you create the device. This will, however, apply to all shaders not just DirectCompute so you should be careful about using this generally.
For special-purpose systems, you can also use registry keys to disable TDRs.
A common usage of glMapBuffer is
previousPBO.render();
bindNextPBO();
GLubyte* src = (GLubyte*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
if(src) {
doSomeWork(src);
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
}
There are 2 PBO working in alternative way.
However, sometimes, the doSomeWork(.) may be in another thread. If the code above applied,
current thread must wait for doSomeWork() to finish. Another solution is:
previousPBO.render();
bindNextPBO();
if(currentPBO.mapped) {
currentPBO.mapped = false;
glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
}
GLubyte* src = (GLubyte*)glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);
if(src) {
currentPBO.mapped = true;
doSomeWork(src);
}
In this case, the map->unmap procedure of the same PBO spans two frames.
Does the unmapped state of PBO stall the rendering of GPU?
Any bad influence on performance?
The second code snippet is just WRONG. What if doSomeWork didn't finish in time? Now it is accessing memory that you unmapped, and will cause an access violation (or worse).
A better approach is to wait for completion of the other thread and unmap the buffer at the end of the frame just before SwapBuffers.