GLSL SpinLock only Mostly Works

GLSL SpinLock only Mostly Works - concurrency

I have implemented a depth peeling algorithm using a GLSL spinlock (inspired by this). In the following visualization, notice how overall the depth peeling algorithm functions correctly (first layer top left, second layer top right, third layer bottom left, fourth layer bottom right). The four depth layers are stored into a single RGBA texture.
Unfortunately, the spinlock sometimes fails to prevent errors--you can see little white speckles, particularly in the fourth layer. There's also one on the wing of the spaceship in the second layer. These speckles vary each frame.
In my GLSL spinlock, when a fragment is to be drawn, the fragment program reads and write a locking value into a separate locking texture atomically, waiting until a 0 shows up, indicating that the lock is open. In practice, I found that the program must be parallel, because if two threads are on the same pixel, the warp cannot continue (one must wait, while the other continues, and all threads in a GPU thread warp must execute simultaneously).
My fragment program looks like this (comments and spacing added):
#version 420 core
//locking texture
layout(r32ui) coherent uniform uimage2D img2D_0;
//data texture, also render target
layout(RGBA32F) coherent uniform image2D img2D_1;
//Inserts "new_data" into "data", a sorted list
vec4 insert(vec4 data, float new_data) {
if (new_data<data.x) return vec4( new_data,data.xyz);
else if (new_data<data.y) return vec4(data.x,new_data,data.yz);
else if (new_data<data.z) return vec4(data.xy,new_data,data.z);
else if (new_data<data.w) return vec4(data.xyz,new_data );
else return data;
}
void main() {
ivec2 coord = ivec2(gl_FragCoord.xy);
//The idea here is to keep looping over a pixel until a value is written.
//By looping over the entire logic, threads in the same warp aren't stalled
//by other waiting threads. The first imageAtomicExchange call sets the
//locking value to 1. If the locking value was already 1, then someone
//else has the lock, and can_write is false. If the locking value was 0,
//then the lock is free, and can_write is true. The depth is then read,
//the new value inserted, but only written if can_write is true (the
//locking texture was free). The second imageAtomicExchange call resets
//the lock back to 0.
bool have_written = false;
while (!have_written) {
bool can_write = (imageAtomicExchange(img2D_0,coord,1u) != 1u);
memoryBarrier();
vec4 depths = imageLoad(img2D_1,coord);
depths = insert(depths,gl_FragCoord.z);
if (can_write) {
imageStore(img2D_1,coord,depths);
have_written = true;
}
memoryBarrier();
imageAtomicExchange(img2D_0,coord,0);
memoryBarrier();
}
discard; //Already wrote to render target with imageStore
}
My question is why this speckling behavior occurs? I want the spinlock to work 100% of the time! Could it relate to my placement of memoryBarrier()?

For reference, here is locking code that has been tested to work on Nvidia driver 314.22 & 320.18 on a GTX670. Note that existing compiler optimization bugs are triggered if the code is reordered or rewritten to logically equivalent code (see comments below.) Note in the below I use bindless image references.
// sem is initialized to zero
coherent uniform layout(size1x32) uimage2D sem;
void main(void)
{
ivec2 coord = ivec2(gl_FragCoord.xy);
bool done = false;
uint locked = 0;
while(!done)
{
// locked = imageAtomicCompSwap(sem, coord, 0u, 1u); will NOT work
locked = imageAtomicExchange(sem, coord, 1u);
if (locked == 0)
{
performYourCriticalSection();
memoryBarrier();
imageAtomicExchange(sem, coord, 0u);
// replacing this with a break will NOT work
done = true;
}
}
discard;
}

The "imageAtomicExchange(img2D_0,coord,0);" needs to be inside the if statement, since it is resetting the lock variable even for threads that didn't have it! Changing this fixes it.

Related

GLSL while loop performance is independent from work done inside of it

I'm currently trying to implement a path tracer inside a fragment shader which leverages a very simple BVH
The code for the BVH intersection is based on the following idea:
bool BVHintersects( Ray ray ) {
Object closestObject;
vec2 toVisit[100]; // using a stack to keep track which node should be tested against the current ray
int stackPointer = 1;
toVisit[0] = vec2(0.0, 0.0); // coordinates of the root node in the BVH hierarcy
while(stackPointer > 0) {
stackPointer--; // pop the BVH node to examine
if(!leaf) {
// examine the BVH node and eventually update the stackPointer and toVisit
}
if(leaf) {
// examine the leaf and eventually update the closestObject entry
}
}
}
The problem with the above code, is that on the second light bounce something very strange starts to happen, assuming I'm calculating light bounces this way:
vec3 color = vec3(0.0);
vec3 normal = vec3(0.0);
// first light bounce
bool intersects = BVHintersect(ro, rd, color, normal);
vec3 lightPos = vec3(5, 15, 0);
// updating ray origin & direction
ro = ro + rd * (t - 0.01);
rd = normalize(lightPos - ro);
// second light bounce used only to calculate shadows
bool shadowIntersects = BVHintersect(ro, rd, color, normal);
The second call to BVHintersect would run indefinitely, because the while loop never exits, but from many tests I've done on that second call I'm sure that the stackPointer eventually goes back to 0 successfully, in fact if I place the following code just under the while loop:
int iterationsMade = 0;
while(stackPointer > 0) {
iterationsMade++;
if(iterationsMade > 100) {
break;
}
// the rest of the loop
// after the functions ends it also returns "iterationsMade"
the variable "iterationsMade" is always under 100, the while loop doesn't run infinitely, but performance wise it's as if I did "100" iterations, even if "iterationsMade" is never bigger than say 10 or 20. Increasing the hardcoded "100" to a bigger value would linearly degrade performance
What could be the possible causes for this behaviour? What's a possible reason for that second call to BVHIntersect to get stuck inside that while loop if it never does more than 10-20 iterations?
Source for the the BVHintersect function:
https://pastebin.com/60SYRQAZ

So, there's a funny thing about loops in shaders (or most SIMD circumstances):
The entire wave will take at least as long to execute as the slowest thread. So, if one thread needs to take ~100 iterations, then they ALL take 100 iterations. Depending on your platform and compiler, the loop may be unrolled to 100 iterations (or whatever upper bound you choose). Anything after the break won't affect the final output, but the rest of the unrolled loop will still have to be processed. Early-out isn't always possible.
There are a number of ways around this, but perhaps the most straightforward is to do this in multiple passes with a lower max iterations value.
I would also run your shader through a compiler and look at the generated code. Compare different versions with different max iterations and look at things like the length of compiled shader.
See this answer for a little more information.

Compute shader does not write to buffer?

I am trying to do culling on a compute shader.
My problem is that my atomic counter does not seem to get written to by the shader, or it does but then gets nullified?
Renderdoc says it has no data but there are values in InstancesOut
(see picture at bottom)
This is my compute shader:
#version 450
#extension GL_ARB_separate_shader_objects : enable
#extension GL_ARB_shading_language_420pack : enable
struct Indirect
{
uint indexCount;
uint instanceCount;
uint firstIndex;
uint vertexOffset;
uint firstInstance;
};
struct Instance
{
vec4 position;
};
layout (binding = 0, std430) buffer IndirectDraws
{
Indirect indirects[];
};
layout (binding = 1) uniform UBO
{
vec4 frustum[6];
} ubo;
layout (binding = 2, std140) readonly buffer Instances
{
Instance instances[];
};
layout (binding = 3, std140) writeonly buffer InstancesOut
{
Instance instancesOut[];
};
layout (binding = 4) buffer Counter
{
uint counter;
};
bool checkFrustrum(vec4 position, float radius)
{
for(uint i = 0; i < 6; i++)
if(dot(position, ubo.frustum[i]) + radius < 0.0)
return false;
return true;
}
layout (local_size_x = 1) in;
void main()
{
uint i = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * gl_NumWorkGroups.x * gl_WorkGroupSize.x;
uint instanceCount = 0;
if(i == 0)
atomicExchange(counter, 0);
for(uint x = 0; x < indirects[i].instanceCount; x++)
{
vec4 position = instances[indirects[i].firstInstance + x].position;
//if(checkFrustrum(position, 1.0))
//{
instancesOut[atomicAdd(counter, 1)].position = position;
instanceCount++;
//}
}
//indirects[i].instanceCount = instanceCount;
indirects[i].instanceCount = i; // testing
}
Picture of buffers in RenderDoc
Thanks for your help!

There's so much that it seems you're misunderstanding about how synchronization and workgroups work.
Within a compute shader, atomics will allow you to synchronize across workgroups. However, there's no guarantee for the order that workgroups execute, so atomicExchange(counter, 0); is not gauranteed to happen before the other workgroups execute. Error #1?
A workgroup size of 1 is a tremendous waste of resources, particularly if you're going through the expense of synchronizing across workgroups. Synchronization within a workgroup will always be fastest, and it allows you to actually use your gpu resources (most GPUs are organized into modules containing SIMD processors that can only handle execution on one workGroup at a time. If you're only using size 1 workgroups, 31/32 or 63/64 of those processors sit idle. {caveat, most of those same processors can hold multiple workgroups in memory simultaneously, but execution only happens on one at any given moment}). Further, within a workgroup you can synchronize execution with barriers, ensuring order of operations. Error #2?
atomicCounterIncrement is probably a better instruction if you're only ever adding one.
In your particular application, why is the answer for instancesOut wrong? it actually seems right to me, every input ended up in the output, without a guaranteed order (because you're not guaranteed that workgroups execute in a particular order, i.e. that's how parallel execution works). If you wanted them in order, calculate it from the invocation IDs?
As for why renderDoc doesn't show you a value in counter, i don't know, it should have a value if it's mapped correctly.

GLSL Compute Shader Setting buffer with lookup table results in no data written, setting the same buffer with other data works

I am attempting to implement a slightly modified version of this standard marching cubes algorithm in a compute shader.
I have reached the stage at which triTable is used to insert the correct vertex indices into a buffer and have modified the table to be 1 dimensional (const int triTable[4096]={-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,8,3...})
The following code shows the error that I am experiencing (this does not implement the algorithm however it demonstrates the current issue fully):
layout(binding=1) buffer Grid
{
float GridData[]; //contains 512*512*512 data volume previously generated, unused in this test case
};
uniform uint marchableCount;
uniform uint pointCount;
layout(std430, binding = 4) buffer X {uvec4 marchableList[];}; //format is x,y,z,cubeIndex
layout(std430, binding = 5) buffer v {vec4 vertices[];};
layout(std430,binding = 6) buffer n {vec4 normals[];};
layout(binding = 7) uniform atomic_uint triCount;
void main()
{
uvec3 gid = marchableList[gl_GlobalInvocationID.x].xyz; //xyz of grid cell
int E = int(edgeTable[marchableList[gl_GlobalInvocationID.x].w]);
if (E != 0)
{
uint cubeIndex = marchableList[gl_GlobalInvocationID.x].w;
uint index = atomicCounterIncrement(triCount);
int tCount = 0;//unused in this test, used for iteration in actual algorithm
int tGet = tCount+16*int(cubeIndex); //correction from converting 2d array to 1d array
vertices[index] = vec4(tGet);
}
}
This code produces expected values: the vertices buffer is filled with data and the atomic counter increments
changing this line:
vertices[index] = vec4(tGet);
to
vertices[index] = vec4(triTable[tGet]);
or
vertices[index] = vec4(triTable[tGet]+1);
(demonstrating that triTable is not coincidentally returning zeros)
results in what appears to be a complete failure of the shader: the buffer is filled with zeros and the atomic counter does not increment. No error messages are output when the shader is compiled. tGet is less than 4096.
The following test cases also produce the correct output:
vertices[index] = vec4(triTable[3]); //-1
vertices[index] = vec4(triTable[4095]); //also -1
showing that triTable is in fact implemented correctly
What causes the shader to have issues in these very specific cases?

I'm more surprised that const int triTable[4096] = {...}; compiles at all. That array, if it is actually needed, is 16KB in size. That's a lot for a shader, even if the array lives in shared memory.
What is most likely happening is that, whenever the compiler detects usage of this array that it can't optimize it out to a simple value (triTable[3] will always be 1, so the compiler doesn't need to store the whole table), the compilation either fails or results in a non-functional shader.
It would be best to make this table a uniform buffer. An SSBO might work too, but some hardware implements uniform blocks through specialized memory rather than with a global memory fetch.

Compare two states into the shader

How can I compare into the shader two states from different frames by means of global variable? I need compare states of mouse position between two frames, and if it didn't change, to do {bla bla bla}.
For example:
vec2 focusNew = vec2(0.0);
float x;
float y;
void main
{
vec2 focus = vec2 ( x, y-1);
if ((focusNew - focus) <= 0.00001) // (focusNew == focus)
{bla bla bla}
focusNew = focus;
}
But focusNew doesn't save current state.

You can't. Or at least not that way. Remember: shaders are run thousands of times per frame.
I would explain how you might actually do that, but it's abundantly clear that you don't really mean what you're saying you mean. The mouse state changes from frame to frame. But that's all stuff that happens on the CPU, and it happens once, not once per shader. Every shader therefore would compute the same value.
So there's no point in making the shader do it. Just do the condition on the CPU, then provide a uniform that tells the shader(s) whether or not to do the {bla bla bla}.

GLSL Channel Selection

I have a GLSL shader that reads from one of the channels (e.g. R) of an input texture and then writes to the same channel in an output texture. This channel has to be selected by the user.
What I can think of right now is to just use an int uniform and tons of if-statements:
uniform sampler2D uTexture;
uniform int uChannelId;
varying vec2 vUv;
void main() {
//read in data from texture
vec4 t = texture2D(uTexture, vUv);
float data;
if (uChannelId == 0) {
data = t.r;
} else if (uChannelId == 1) {
data = t.g;
} else if (uChannelId == 2) {
data = t.b;
} else {
data = t.a;
}
//process the data...
float result = data * 2; //for example
//write out
if (uChannelId == 0) {
gl_FragColor = vec4(result, t.g, t.b, t.a);
} else if (uChannelId == 1) {
gl_FragColor = vec4(t.r, result, t.b, t.a);
} else if (uChannelId == 2) {
gl_FragColor = vec4(t.r, t.g, result, t.a);
} else {
gl_FragColor = vec4(t.r, t.g, t.b, result);
}
}
Is there any way of doing something like a dictionary access such as t[uChannelId]?
Or perhaps I should have 4 different versions of the same shader, each of which processes a different channel, so that I can avoid all the if-statements?
What is the best way to do this?
EDIT: To be more specific, I am using WebGL (Three.js)

There is such a way, and it is as simple as you actually wrote it in the question. Just use t[channelId]. To quote the GLSL Spec (This is from Version 3.30, Section 5.5, but applies to other versions as well):
Array subscripting syntax can also be applied to vectors to provide numeric indexing. So in
vec4 pos;
pos[2] refers to the third element of pos and is equivalent to pos.z. This allows variable indexing into a
vector, as well as a generic way of accessing components. Any integer expression can be used as the
subscript. The first component is at index zero. Reading from or writing to a vector using a constant
integral expression with a value that is negative or greater than or equal to the size of the vector is illegal.
When indexing with non-constant expressions, behavior is undefined if the index is negative, or greater
than or equal to the size of the vector.
Note that for the first part of your code, you use this to access a specific channel of a texture. You could also use the ARB_texture_swizzle functionality. In that case, you would just use a fxied channel, say r, for access in the shader and what swizzle the actual texture channels so that wahtever channel you want to access becomes r.
Update: as the target platform turned out to be webgl, these suggestions are not available. However, a simple solution would be to use a vec4 uniform in place of uChannelID which is 1.0 for the selected component and 0.0 for all others. Say this variable is called uChannelSel. You could use data=dot(t, uChannelSel) in the first part and gl_FragColor=(vec4(1.0)-uChannelSel) * t + uChannelSel*result for the second part.

as i'm sure you know, branching can be expensive in shaders. however, it sounds like it'll always be the same channel in a pass (yes?), so you might maintain enough cohesion to see good performance.
it's been a good while since i've used GLSL, but if you're using a newer version, maybe you could do some bitwise shifting (<< or >>) magic? you would read the texture into int instead of vec4, then shift it a number of bits depending on which channel you want to read.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js