While loop in compute shader is crashing my video card driver

While loop in compute shader is crashing my video card driver - c++

I am trying to implement a Binary Search in a compute shader with HLSL. It's not a classic Binary Search as the search key as well as the array values are float. If there is no matching array value to the search key, the search is supposed to return the last index (minIdx and maxIdx match at this point). This is the worst case for classic Binary Search as it takes the maximum number of operations, I am aware of this.
So here's my problem:
My implementation looks like this:
uint BinarySearch (Texture2D<float> InputTexture, float key, uint minIdx, uint maxIdx)
{
uint midIdx = 0;
while (minIdx <= maxIdx)
{
midIdx = minIdx + ((maxIdx + 1 - minIdx) / 2);
if (InputTexture[uint2(midIdx, 0)] == key)
{
// this might be a very rare case
return midIdx;
}
// determine which subarray to search next
else if (InputTexture[uint2(midIdx, 0)] < key)
{
// as we have a decreasingly sorted array, we need to change the
// max index here instead of the min
maxIdx = midIdx - 1;
}
else if (InputTexture[uint2(midIdx, 0)] > key)
{
minIdx = midIdx;
}
}
return minIdx;
}
This leads to my video driver crashing on program execution. I don't get a compile error.
However, if I use an if instead of the while I can execute it and the first iteration works as expected.
I already did a couple of searches and I suspect this might have to do something with dynamic looping in a compute shader. But I have no prior experience with compute shaders and only little experience with HLSL as well, which is why I feel kind of lost.
I am compiling this with cs_5_0.
Could anyone please explain what I am doing wrong or at least hint me to some documentation/explanation? Anything that can get me started on solving and understanding this would be super-appreciated!

DirectCompute shaders are still subject to the Timeout Detection & Recovery (TDR) behavior in the drivers. This basically means if your shader takes more than 2 seconds, the driver assumes the GPU has hung and resets it. This can be challenging with DirectCompute where you intentionally want the shader to run a long while (much longer than rendering usually would). In this case it may be a bug, but it's something to be aware of.
With Windows 8.0 or later, you can allow long-running shaders by using D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT when you create the device. This will, however, apply to all shaders not just DirectCompute so you should be careful about using this generally.
For special-purpose systems, you can also use registry keys to disable TDRs.

Related

glfwSwapBuffers slow (>3s)

The bounty expires in 7 days. Answers to this question are eligible for a +50 reputation bounty.
Paul Aner is looking for a canonical answer:
I think the reason for this question is clear: I want the main-loop to NOT lock while a compute shader is processing larger amounts of data. I could try and seperate the data into smaller snippets, but if the computations were done on CPU, I would simply start a thread and everything would run nice and smoothly. Altough I of course would have to wait until the calculation-thread delivers new data to update the screen - the GUI (ImGUI) would not lock up...
I have written a program that does some calculations on a compute shader and the returned data is then being displayed. This works perfectly, except that the program execution is blocked while the shader is running (see code below) and depending on the parameters, this can take a while:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
GLfloat* mapped = (GLfloat*)(glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY));
memcpy(Result, mapped, sizeof(GLfloat) * X * Y);
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
}
void main
{
// Initialization stuff
// ...
while (glfwWindowShouldClose(Window) == 0)
{
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
glfwPollEvents();
glfwSwapInterval(2); // Doesn't matter what I put here
CalculatateSomething(Result);
Render(Result);
glfwSwapBuffers(Window.WindowHandle);
}
}
To keep the main loop running while the compute shader is calculating, I changed CalculateSomething to something like this:
void CalculateSomething(GLfloat* Result)
{
// load some uniform variables
glDispatchCompute(X, Y, 1);
GPU_sync = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0);
}
bool GPU_busy()
{
GLint GPU_status;
if (GPU_sync == NULL)
return false;
else
{
glGetSynciv(GPU_sync, GL_SYNC_STATUS, 1, nullptr, &GPU_status);
return GPU_status == GL_UNSIGNALED;
}
}
These two functions are part of a class and it would get a little messy and complicated if I had to post all that here (if more code is needed, tell me). So every loop when the class is told to do the computation, it first checks, if the GPU is busy. If it's done, the result is copied to CPU-memory (or a calculation is started), else it returns to main without doing anything else. Anyway, this approach works in that it produces the right result. But my main loop is still blocked.
Doing some timing revealed that CalculateSomething, Render (and everything else) runs fast (as I would expect them to do). But now glfwSwapBuffers takes >3000ms (depending on how long the calculations of the compute shader take).
Shouldn't it be possible to switch buffers while a compute shader is running? Rendering the result seems to work fine and without delay (as long as the compute shader is not done yet, the old result should get rendered). Or am I missing something here (queued OpenGL calls get processed before glfwSwapBuffers does something?)?
Edit:
I'm not sure why this question got closed and what additional information is needed (maybe other than the OS, which would be Windows). As for "desired behavior": Well - I'd like the glfwSwapBuffers-call not to block my main loop. For additional information, please ask...
As pointed out by Erdal Küçük an implicit call of glFlush might cause latency. I did put this call before glfwSwapBuffer for testing purposes and timed it - no latency here...
I'm sure, I can't be the only one who ever ran into this problem. Maybe someone could try and reproduce it? Simply put a compute shader in the main-loop that takes a few seconds to do it's calculations. I have read somewhere that similar problems occur escpecially when calling glMapBuffer. This seems to be an issue with the GPU-driver (mine would be an integrated Intel-GPU). But nowhere have I read about latencies above 200ms...

Solved a similar issue with GL_PIXEL_PACK_BUFFER effectively used as an offscreen compute shader. The approach with fences is correct, but you then need to have a separate function that checks the status of the fence using glGetSynciv to read the GL_SYNC_STATUS. The solution (admittedly in Java) can be found here.
An explanation for why this is necessary can be found in: in #Nick Clark's comment answer:
Every call in OpenGL is asynchronous, except for the frame buffer swap, which stalls the calling thread until all submitted functions have been executed. Thus, the reason why glfwSwapBuffers seems to take so long.
The relevant portion from the solution is:
public void finishHMRead( int pboIndex ){
int[] length = new int[1];
int[] status = new int[1];
GLES30.glGetSynciv( hmReadFences[ pboIndex ], GLES30.GL_SYNC_STATUS, 1, length, 0, status, 0 );
int signalStatus = status[0];
int glSignaled = GLES30.GL_SIGNALED;
if( signalStatus == glSignaled ){
// Ready a temporary ByteBuffer for mapping (we'll unmap the pixel buffer and lose this) and a permanent ByteBuffer
ByteBuffer pixelBuffer;
texLayerByteBuffers[ pboIndex ] = ByteBuffer.allocate( texWH * texWH );
// map data to a bytebuffer
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, pbos[ pboIndex ] );
pixelBuffer = ( ByteBuffer ) GLES30.glMapBufferRange( GLES30.GL_PIXEL_PACK_BUFFER, 0, texWH * texWH * 1, GLES30.GL_MAP_READ_BIT );
// Copy to the long term ByteBuffer
pixelBuffer.rewind(); //copy from the beginning
texLayerByteBuffers[ pboIndex ].put( pixelBuffer );
// Unmap and unbind the currently bound pixel buffer
GLES30.glUnmapBuffer( GLES30.GL_PIXEL_PACK_BUFFER );
GLES30.glBindBuffer( GLES30.GL_PIXEL_PACK_BUFFER, 0 );
Log.i( "myTag", "Finished copy for pbo data for " + pboIndex + " at: " + (System.currentTimeMillis() - initSphereStart) );
acknowledgeHMReadComplete();
} else {
// If it wasn't done, resubmit for another check in the next render update cycle
RefMethodwArgs finishHmRead = new RefMethodwArgs( this, "finishHMRead", new Object[]{ pboIndex } );
UpdateList.getRef().addRenderUpdate( finishHmRead );
}
}
Basically, fire off the computer shader, then wait for the glGetSynciv check of GL_SYNC_STATUS to equal GL_SIGNALED, then rebind the GL_SHADER_STORAGE_BUFFER and perform the glMapBuffer operation.

Problem testing DTid.x Direct3D ComputeShader HLSL

I’m attempting to write a slightly simple compute shader that does a simple moving average.
It is my first shader where I had to test DTid.x for certain conditions related to logic.
The shader works, the moving average is calculated as expected, except (ugh), for the case of DTid.x = 0 where I get a bad result.
It seems my testing of value DTid.x is somehow corrupted or not possible for case DTid.x = 0
I may be missing some fundamental understanding how compute shaders work as this piece of code seems super simple but it doesn't work as I'd expect it to.
Hopefully someone can tell me why this code doesn't work for case DTid.x = 0
For example, I simplified the shader to...
[numthreads(1024, 1, 1)]
void CSSimpleMovingAvgDX(uint3 DTid : SV_DispatchThreadID)
{
// I added below trying to limit the logic?
// I initially had it check for a range like >50 and <100 and this did work as expected.
// But I saw that my value at DTid.x = 0 was corrupted and I started to work on solving why. But no luck.
// It is just the case of DTid.x = 0 where this shader does not work.
if (DTid.x > 0)
{
return;
}
nAvgCnt = 1;
ft0 = asfloat(BufferY0.Load(DTid.x * 4)); // load data at actual DTid.x location
if (DTid.x > 0) // to avoid loading a second value for averaging
{
// somehow this code is still being called for case DTid.x = 0 ?
nAvgCnt = nAvgCnt + 1;
ft1 = asfloat(BufferY0.Load((DTid.x - 1) * 4)); // load data value at previous DTid.x location
}
if (nAvgCnt > 1) // If DTid.X was larger than 0, then we should have loaded ft1 and we can avereage ft0 and ft1
{
result = ((ft0 + ft1) / ((float)nAvgCnt));
}
else
{
result = ft0;
}
// And when I add code below, which should override above code, the result is still corrupted? //
if (DTid.x < 2)
result = ft0;
llByteOffsetLS = ((DTid.x) * dwStrideSomeBuffer);
BufferOut0.Store(llByteOffsetLS, asuint(result)); // store result, where all good except for case DTid.x = 0
}

I am compiling the shader with FXC. My shader was slightly more involved than above, I added the /Od option and the code behaved as expected. Without the /Od option I tried to refactor the code over and over with no luck but eventually I changed variable names for every possible section to make sure the compiler would treat them separately and eventually success. So, the lesson I learned is never reuse a variable in any way. Another solution, worse case, would be to decompile the compiled shader to understand how it was optimized. If attempting a large shader with several conditions/branches, I'd start with /Od and then eventually remove, and do not reuse variables, else you may start chasing problems that are not truly problems.

Communicating an array of bvec2 between host->shader and/or shader->shader

I need to communicate two boolean values per entry of an array in a compute shader. For now, I'm getting them from the cpu, but later I will want to generate these values from another compute shader that runs before that one. I got this working as follows:
Using glm::bvec2 I can place the booleans relatively packed into memory (the bvec stores one bool per byte. could be nicer, but will do for now, I can always manually pack this). Then, I use vkMapMemory to place the data into a Vulkan buffer (I then copy it to a device local buffer, but that's probably irrelevant here).
GLSL's bvec2 is not equivalent to that, unfortunately (or at least it won't give me the expected values if I use it, maybe I'm doing it wrong? Using bvec2 changes[] yields wrong results in the following code. I suspect an alignment mismatch). Because of that, the compute shader accesses this array as follows:
layout (binding = 2, scalar) buffer Changes
{
uint changes[];
};
void main() {
//uint is 4 byte, glm::bvec2 is 2 byte
uint changeIndex = gl_GlobalInvocationID.x / 2;
//order seems to be reversed in memory:
//v1(x1, y1) followed by v2(x2, y2) is stored as: 0000000(y2) 0000000(x2) 0000000(y1) 0000000(x1)
uint changeOffset = (gl_GlobalInvocationID.x % 2) * 16;
uint maskx = 1 << (changeOffset + 0);
uint masky = 1 << (changeOffset + 8);
uint uchange = changes[changeIndex];
bvec2 change = bvec2(uchange & maskx, uchange & masky);
}
This works. Took a bit of trial and error but there we go. I have two questions now:
Is there a more elegant way to do this?
When generating the values via compute shaders, I would not be using glm::bvec2. Should I perhaps just manually pack the booleans - one per bit - into uints, or is there a better way?
Performance is pretty important to me in this application, as I'm trying to benchmark things. Memory usage optimizations are secondary, but also worth considering. Being relatively inexperienced with optimizing GLSL, I'm happy about any advice you can give me.

Since glm::bvec2 stores a boolean as two bytes, perhaps the explicitly 8-bit unsigned integer vector type u8vec2 provided by the GL_EXT_shader_8bit_storage extension would be more convenient here? I don't know if the Vulkan driver you're using will support the necessary feature (I assume it's storageBuffer8BitAccess), though.

The comment by Andrea mentions a useful extension: GL_EXT_shader_8bit_storage.
As the writing access to my booleans is done in parallel, the only option for tight packing is atomics. I've chosen to trade memory efficiency for performance by storing two booleans in one byte, "wasting" 6 bits. The code is as follows:
#extension GL_EXT_shader_8bit_storage : enable
void getBools(in uint data, out bool split, out bool merge) {
split = (data & 1) > 0;
merge = (data & 2) > 0;
}
uint setBools(in bool split, in bool merge) {
uint result = 0;
if (split) result = result | 1;
if (merge) result = result | 2;
return result;
}
//usage:
layout (binding = 4, scalar) buffer ChangesBuffer
{
uint8_t changes[];
};
//[...]
bool split, merged;
getBools(uint(changes[invocationIdx]), split, merged);
//[...]
changes[idx] = uint8_t(setBools(split, merge));
Note the constructors, the data types provided by the extension do not provide any arithmetic operations and must be converted before use.

How to optimise large loops for debug mode

I have implemented a pixel mask class used for checking for perfect collision. I am using SFML so the implementation is fairly straight forward:
Loop through each pixel of the image and decide whether its true or false based on its transparency value. Here is the code I have used:
// Create an Image from the given texture
sf::Image image(texture.copyToImage());
// measure the time this function takes
sf::Clock clock;
sf::Time time = sf::Time::Zero;
clock.restart();
// Reserve memory for the pixelMask vector to avoid repeating allocation
pixelMask.reserve(image.getSize().x);
// Loop through every pixel of the texture
for (unsigned int i = 0; i < image.getSize().x; i++)
{
// Create the mask for one line
std::vector<bool> tempMask;
// Reserve memory for the pixelMask vector to avoid repeating allocation
tempMask.reserve(image.getSize().y);
for (unsigned int j = 0; j < image.getSize().y; j++)
{
// If the pixel is not transparrent
if (image.getPixel(i, j).a > 0)
// Some part of the texture is there --> push back true
tempMask.push_back(true);
else
// The user can't see this part of the texture --> push back false
tempMask.push_back(false);
}
pixelMask.push_back(tempMask);
}
time = clock.restart();
std::cout << std::endl << "The creation of the pixel mask took: " << time.asMicroseconds() << " microseconds (" << time.asSeconds() << ")";
I have used the an instance of the sf::Clock to meassure time.
My problem is that this function takes ages (e.g. 15 seconds) for larger images(e.g. 1280x720). Interestingly, only in debug mode. When compiling the release version the same texture/image only takes 0.1 seconds or less.
I have tried to reduce memory allocations by using the resize() method but it didn't change much. I know that looping through almost 1 million pixels is slow but it should not be 15 seconds slow should it?
Since I want to test my code in debug mode (for obvious reasons) and I don't want to wait 5 min till all the pixel masks have been created, what I am looking for is basically a way to:
Either optimise the code / have I missed somthing obvious?
Or get something similar to the release performance in debug mode
Thanks for your help!

Optimizing For Debug
Optimizing for debug builds is generally a very counter-productive idea. It could even have you optimize for debug in a way that not only makes maintaining code more difficult, but may even slow down release builds. Debug builds in general are going to be much slower to run. Even with the flattest kind of C code I write which doesn't pose much for an optimizer to do beyond reasonable register allocation and instruction selection, it's normal for the debug build to take 20 times longer to finish an operation. That's just something to accept rather than change too much.
That said, I can understand the temptation to do so at times. Sometimes you want to debug a certain part of code only for the other operations in the software to takes ages, requiring you to wait a long time before you can even get to the code you are interested in tracing through. I find in those cases that it's helpful, if you can, to separate debug mode input sizes from release mode (ex: having the debug mode only work with an input that is 1/10th of the original size). That does cause discrepancies between release and debug as a negative, but the positives sometimes outweigh the negatives from a productivity standpoint. Another strategy is to build parts of your code in release and just debug the parts you're interested in, like building a plugin in debug against a host application in release.
Approach at Your Own Peril
With that aside, if you really want to make your debug builds run faster and accept all the risks associated, then the main way is to just pose less work for your compiler to optimize away. That's going to be flatter code typically with more plain old data types, less function calls, and so forth.
First and foremost, you might be spending a lot of time on debug mode assertions for safety. See things like checked iterators and how to disable them:
https://msdn.microsoft.com/en-us/library/aa985965.aspx
For your case, you can easily flatten your nested loop into a single loop. There's no need to create these pixel masks with separate containers per scanline, since you can always get at your scanline data with some basic arithmetic (y*image_width or y*image_stride). So initially I'd flatten the loop. That might even help modestly for release mode. I don't know the SFML API so I'll illustrate with pseudocode.
const int num_pixels = image.w * image.h;
vector<bool> pixelMask(num_pixels);
for (int j=0; j < num_pixels; ++j)
pixelMask[j] = image.pixelAlpha(j) > 0;
Just that already might help a lot. Hopefully SFML lets you access pixels with a single index without having to specify column and row (x and y). If you want to go even further, it might help to grab the pointer to the array of pixels from SFML (also hopefully possible) and use that:
vector<bool> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
for (int j=0; j < num_pixels; ++j)
{
// Assuming 32-bit pixels (should probably use uint32_t).
// Note that no right shift is necessary when you just want
// to check for non-zero values.
const unsigned int alpha = pixels[j] & 0xff000000;
pixelMask[j] = alpha > 0;
}
Also vector<bool> stores each boolean as a single bit. That saves memory but translates to some more instructions for random-access. Sometimes you can get a speed up even in release by just using more memory. I'd test both release and debug and time carefully, but you can try this:
vector<char> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
char* pixelUsed = &pixelMask[0];
for (int j=0; j < num_pixels; ++j)
{
const unsigned int alpha = pixels[j] & 0xff000000;
pixelUsed[j] = alpha > 0;
}

Loops are faster if working with costants:
1. for (unsigned int i = 0; i < image.getSize().x; i++) get this image.getSize() before the loop.
2. get the mask for one line out of the loop and reuse it. Lines are of the same length I assume. std::vector tempMask;
This shall speed you up a bit.
Note that the compilation for debugging gives way more different machine code.

How to run a graph algorithm concurrently in Java using multi-core parallelism

I want to run an algorithm on large graphs concurrently, using multi-core parallelism. I have been working on it for a while, but haven't been able to come up with a good solution.
This is the naive algorithm:
W - a very large number
double weight = 0
while(weight < W)
- v : get_random_node_from(Graph)
- weight += calculate(v)
I looked into fork-and-join, but can't figure out a way to divide this problem into smaller subproblems.
Then I tried using Java 8 streams, for which I need to create a lambda expression. When I tried doing something like this:
double weight = 0
Callable<Object> task = () -> {
can not update weight here, as it needs to be final
}
My question is, is it possible to update a variable like weight in a lambda method? Or is there a better way in which this problem can be solved?
The closest I have got is by using ExecutorService, but run into the problems of synchronization.
------------EDIT--------------
Here is the detailed algorithm:
In a nutshell, what I am trying to do, is traverse a massive graph, perform an operation on randomly selected nodes(as long as weight < W) and update a global structure Index.
This is taking too long as it doesn't utilize the full power of the CPU.
Ideally, all threads/processes on multiple cores would perform the operations on the randomly selected nodes, and update the shared weight and Index.
Note: It doesn't matter if different threads pick up the same node, as it's random without replacement.
Algorithm:
function Serial () {
List<List<Integer>> I (shared data structure which I want to update)
double weight
//// Task which I want to parallelize
while(weight < W) {
v : get_random_node_from(Graph)
bfs(v, affected_nodes) ...// this will fill up affected_nodes by v
foreach(affected_node in affected_nodes) {
// update I related to affected_node
// and do other computation
}
weight += affected_nodes.size()
}
///////// Parallelization ends here
use_index(I) // I is passed now to some other method(not important) to get further results
}
The important thing is, all threads update the same I and weight.
Thanks.

Well you could wrap that weight into an array of a single element, it's sort of a know trick for this kind of stuff; even done internally by java, like this:
weight[0] = weight[0] + calculate(v);
But there are problems with this, since you are going to run it in parallel. You will not get the result you want since weight[0] is not thread-safe. And you could use some sort of synchronization, but java already has a great solution for that : DoubleAdder that scales far better in contended environments (and multiple cpus).
A trivial and small example:
DoubleAdder weight = new DoubleAdder();
private static int calculate(int v) {
return v + 1;
}
Stream.of(1, 2, 3, 4, 5, 6, 7, 8, 9)
.parallel()
.forEach(x -> {
int y = calculate(x);
weight.add(y);
});
System.out.println(weight); // 54
Then there is the problem of the randomizer that you are going to choose for this: get_random_node_from(Graph). You need to get a random Node indeed, but at the same time you need to get all of them exactly once.
But you might not need it if you can flatten all the nodes into a single List let's say.
The problem here is that Graphs are usually traversed in a recursive way, you don't know the exact size of it:
while(parent.hasChildren) {
traverse children and so on...
}
This will parallelize bad under Streams, you can look yourself at Spliterators#spliteratorUnknownSize. It will grow arithmetically from 1024; that's why my suggestion of flattening the Nodes into a single List, with known size; that will parallelize much better.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

While loop in compute shader is crashing my video card driver - c++

Related

glfwSwapBuffers slow (>3s)

Problem testing DTid.x Direct3D ComputeShader HLSL

Communicating an array of bvec2 between host->shader and/or shader->shader

How to optimise large loops for debug mode

How to run a graph algorithm concurrently in Java using multi-core parallelism

Categories

Resources