Compute Shader - gl_GlobalInvocationID and local_size - opengl

While trying to implement a naive Compute Shader that assigns affecting lights to a cluster, i have encountered an unexpected(well for a noob like me) behavior:
I invoke this shader with glDispatchCompute(32, 32, 32); and it supposed to write a [light counter + 8 indices] for each invocation into "indices" buffer. But while debugging, i found that my writes into that buffer overlap between invocations even though I use unique clusterId. I detect it by values of indices[outIndexStart] going over 8 and visual flickering.
According to documentation, gl_GlobalInvocationID is gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID. But if set all local sizes to 1, write issues go away. Why local_size affects this code in such a way? And how can i reason about choosing it's value here?
#version 430
layout (local_size_x = 4, local_size_y = 4, local_size_z = 4) in;
uniform int lightCount;
const unsigned int clusterSize = 32;
const unsigned int clusterSquared = clusterSize * clusterSize;
struct LightInfo {
vec4 color;
vec3 position;
float radius;
};
layout(std430, binding = 0) buffer occupancyGrid {
int exists[];
};
layout(std430, binding = 2) buffer lightInfos
{
LightInfo lights [];
};
layout(std430, binding = 1) buffer outputList {
int indices[];
};
void main(){
unsigned int clusterId = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * clusterSize + gl_GlobalInvocationID.z * clusterSquared;
if(exists[clusterId] == 0)
return;
//... not so relevant calculations
unsigned int outIndexStart = clusterId * 9;
unsigned int outOffset = 1;
for(int i = 0; i < lightCount && outOffset < 9; i++){
if(distance(lights[i].position, wordSpace.xyz) < lights[i].radius) {
indices[outIndexStart + outOffset] = i;
indices[outIndexStart]++;
outOffset++;
}
}
}

Let's look at two declarations:
layout (local_size_x = 4, local_size_y = 4, local_size_z = 4) in;
and
const unsigned int clusterSize = 32;
These say different things. The local_size declaration says that each work group will have 4x4x4 invocations, which is 64. By contrast, your clusterSize says that each work group will only have 32 invocations.
If you want to fix this problem, use the actual local size constant provided by the system:
const unsigned int clusterSize = gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z;
And you can even do this:
const uvec3 linearizeInvocation = uvec3{1, clusterSize, clusterSize * clusterSize};
...
unsigned int clusterId = dot(gl_GlobalInvocationID, linearizeInvocation);

Related

What's the GLSL equivalent of [[vk::binding(0, 0)]] RWStructuredBuffer<int> in a compute shader

I have this HLSL and I want to write the equivalent in GLSL. If it's any use, I'm trying to run this example https://github.com/mcleary/VulkanHpp-Compute-Sample
[[vk::binding(0, 0)]] RWStructuredBuffer<int> InBuffer;
[[vk::binding(1, 0)]] RWStructuredBuffer<int> OutBuffer;
[numthreads(1, 1, 1)]
void Main(uint3 DTid : SV_DispatchThreadID)
{
OutBuffer[DTid.x] = InBuffer[DTid.x] * InBuffer[DTid.x];
}
The GLSL equivalent should look something like this:
layout(std430, binding = 0) buffer InBuffer {
int inBuffer[ ];
};
layout(std430, binding = 1) buffer OutBuffer {
int outBuffer[ ];
};
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
void main()
{
outBuffer[gl_GlobalInvocationID.x] = inBuffer[gl_GlobalInvocationID.x] * inBuffer[gl_GlobalInvocationID.x];
}

I'm experiencing very slow OpenGL compute shader compilation (10+ minutes) when using larger work groups, is there anything I can do to speed it up?

So, I'm encountering a really bizarre (at least to me as a compute shader noob) phenomenon when I compile my compute shader using glGetShaderiv(m_shaderID, GL_COMPILE_STATUS, &status). Inexplicably, my compute shader takes much longer to compile when I increase the size of my work groups! When I have one-dimensional work groups, it compiles in less than a second, but when I increase the size of my work groups to 4x1x6, the compute shader takes 10+ minutes to compile! How strange.
For background, I'm trying to implement a light clustering algorithm (essentially the one shown here: http://www.aortiz.me/2018/12/21/CG.html#tiled-shading--forward), and my compute shader is this monster:
// TODO: Figure out optimal tile size, currently using a 16x9x24 subdivision
#define FLT_MAX 3.402823466e+38
#define FLT_MIN 1.175494351e-38
#define DBL_MAX 1.7976931348623158e+308
#define DBL_MIN 2.2250738585072014e-308
layout(local_size_x = 4, local_size_y = 9, local_size_z = 4) in;
// TODO: Change to reflect my light structure
// struct PointLight{
// vec4 position;
// vec4 color;
// uint enabled;
// float intensity;
// float range;
// };
// TODO: Pack this more efficiently
struct Light {
vec4 position;
vec4 direction;
vec4 ambientColor;
vec4 diffuseColor;
vec4 specularColor;
vec4 attributes;
vec4 intensity;
ivec4 typeIndexAndFlags;
// uint flags;
};
// Array containing offset and number of lights in a cluster
struct LightGrid{
uint offset;
uint count;
};
struct VolumeTileAABB{
vec4 minPoint;
vec4 maxPoint;
};
layout(std430, binding = 0) readonly buffer LightBuffer {
Light data[];
} lightBuffer;
layout (std430, binding = 1) buffer clusterAABB{
VolumeTileAABB cluster[ ];
};
layout (std430, binding = 2) buffer screenToView{
mat4 inverseProjection;
uvec4 tileSizes;
uvec2 screenDimensions;
};
// layout (std430, binding = 3) buffer lightSSBO{
// PointLight pointLight[];
// };
// SSBO of active light indices
layout (std430, binding = 4) buffer lightIndexSSBO{
uint globalLightIndexList[];
};
layout (std430, binding = 5) buffer lightGridSSBO{
LightGrid lightGrid[];
};
layout (std430, binding = 6) buffer globalIndexCountSSBO{
uint globalIndexCount;
};
// Shared variables, shared between all invocations WITHIN A WORK GROUP
// TODO: See if I can use gl_WorkGroupSize for this, gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z
// A grouped-shared array which contains all the lights being evaluated
shared Light sharedLights[4*9*4]; // A grouped-shared array which contains all the lights being evaluated, size is thread-count
uniform mat4 viewMatrix;
bool testSphereAABB(uint light, uint tile);
float sqDistPointAABB(vec3 point, uint tile);
bool testConeAABB(uint light, uint tile);
float getLightRange(uint lightIndex);
bool isEnabled(uint lightIndex);
// Runs in batches of multiple Z slices at once
// In this implementation, 6 batches, since each thread group contains four z slices (24/4=6)
// We begin by each thread representing a cluster
// Then in the light traversal loop they change to representing lights
// Then change again near the end to represent clusters
// NOTE: Tiles actually mean clusters, it's just a legacy name from tiled shading
void main(){
// Reset every frame
globalIndexCount = 0; // How many lights are active in t his scene
uint threadCount = gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z; // Number of threads in a group, same as local_size_x, local_size_y, local_size_z
uint lightCount = lightBuffer.data.length(); // Number of total lights in the scene
uint numBatches = uint((lightCount + threadCount -1) / threadCount); // Number of groups of lights that will be completed, i.e., number of passes
uint tileIndex = gl_LocalInvocationIndex + gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z * gl_WorkGroupID.z;
// uint tileIndex = gl_GlobalInvocationID; // doesn't wortk, is uvec3
// Local thread variables
uint visibleLightCount = 0;
uint visibleLightIndices[100]; // local light index list, to be transferred to global list
// Every light is being checked against every cluster in the view frustum
// TODO: Perform active cluster determination
// Each individual thread will be responsible for loading a light and writing it to shared memory so other threads can read it
for( uint batch = 0; batch < numBatches; ++batch){
uint lightIndex = batch * threadCount + gl_LocalInvocationIndex;
//Prevent overflow by clamping to last light which is always null
lightIndex = min(lightIndex, lightCount);
//Populating shared light array
// NOTE: It is VERY important that lightBuffer.data not be referenced after this point,
// since that is not thread-safe
sharedLights[gl_LocalInvocationIndex] = lightBuffer.data[lightIndex];
barrier(); // Synchronize read/writes between invocations within a work group
//Iterating within the current batch of lights
for( uint light = 0; light < threadCount; ++light){
if( isEnabled(light)){
uint lightType = uint(sharedLights[light].typeIndexAndFlags[0]);
if(lightType == 0){
// Point light
if( testSphereAABB(light, tileIndex) ){
visibleLightIndices[visibleLightCount] = batch * threadCount + light;
visibleLightCount += 1;
}
}
else if(lightType == 1){
// Directional light
visibleLightIndices[visibleLightCount] = batch * threadCount + light;
visibleLightCount += 1;
}
else if(lightType == 2){
// Spot light
if( testConeAABB(light, tileIndex) ){
visibleLightIndices[visibleLightCount] = batch * threadCount + light;
visibleLightCount += 1;
}
}
}
}
}
// We want all thread groups to have completed the light tests before continuing
barrier();
// Back to every thread representing a cluster
// Adding the light indices to the cluster light index list
uint offset = atomicAdd(globalIndexCount, visibleLightCount);
for(uint i = 0; i < visibleLightCount; ++i){
globalLightIndexList[offset + i] = visibleLightIndices[i];
}
// Updating the light grid for each cluster
lightGrid[tileIndex].offset = offset;
lightGrid[tileIndex].count = visibleLightCount;
}
// Return whether or not the specified light intersects with the specified tile (cluster)
bool testSphereAABB(uint light, uint tile){
float radius = getLightRange(light);
vec3 center = vec3(viewMatrix * sharedLights[light].position);
float squaredDistance = sqDistPointAABB(center, tile);
return squaredDistance <= (radius * radius);
}
// TODO: Different test for spot-lights
// Has been done by using several AABBs for spot-light cone, this could be a good approach, or even just use one to start.
bool testConeAABB(uint light, uint tile){
// Light light = lightBuffer.data[lightIndex];
// float innerAngleCos = light.attributes[0];
// float outerAngleCos = light.attributes[1];
// float innerAngle = acos(innerAngleCos);
// float outerAngle = acos(outerAngleCos);
// FIXME: Actually do something clever here
return true;
}
// Get range of light given the specified light index
float getLightRange(uint lightIndex){
int lightType = sharedLights[lightIndex].typeIndexAndFlags[0];
float range;
if(lightType == 0){
// Point light
float brightness = 0.01; // cutoff for end of range
float c = sharedLights[lightIndex].attributes.x;
float lin = sharedLights[lightIndex].attributes.y;
float quad = sharedLights[lightIndex].attributes.z;
range = (-lin + sqrt(lin*lin - 4.0 * c * quad + (4.0/brightness)* quad)) / (2.0 * quad);
}
else if(lightType == 1){
// Directional light
range = FLT_MAX;
}
else{
// Spot light
range = FLT_MAX;
}
return range;
}
// Whether the light at the specified index is enabled
bool isEnabled(uint lightIndex){
uint flags = sharedLights[lightIndex].typeIndexAndFlags[2];
return (flags | 1) != 0;
}
// Get squared distance from a point to the AABB of the specified tile (cluster)
float sqDistPointAABB(vec3 point, uint tile){
float sqDist = 0.0;
VolumeTileAABB currentCell = cluster[tile];
cluster[tile].maxPoint[3] = tile;
for(int i = 0; i < 3; ++i){
float v = point[i];
if(v < currentCell.minPoint[i]){
sqDist += (currentCell.minPoint[i] - v) * (currentCell.minPoint[i] - v);
}
if(v > currentCell.maxPoint[i]){
sqDist += (v - currentCell.maxPoint[i]) * (v - currentCell.maxPoint[i]);
}
}
return sqDist;
}
Edit: Whoops, lost the bottom part of this!
What I don't understand is why changing the size of the work groups affects compilation time at all? It sort of defeats the point of the algorithm if my work group sizes are too small for the compute shader to run efficiently, so I'm hoping there's something that I'm missing.
As a last note, I'd like to avoid using glGetProgramBinary as a solution. Not only because it merely circumvents the issue instead of solving it, but because pre-compiling shaders will not play nicely with the engine's current architecture.
So, I'm figuring that this must be a bug in the compiler, since I've replaced the loop in my sqDistPointAABB function with:
vec3 minPoint = currentCell.minPoint.xyz;
vec3 maxPoint = currentCell.maxPoint.xyz;
vec3 t1 = vec3(lessThan(point, minPoint));
vec3 t2 = vec3(greaterThan(point, maxPoint));
vec3 sqDist = t1 * (minPoint - point) * (minPoint - point) + t2 * (maxPoint - point) * (maxPoint - point);
return sqDist.x + sqDist.y + sqDist.z;
And it compiles just fine now, in less than a second! So strange

OpenGL buffer problem when adding >= 2^16 numbers

I'm facing some strange difficulties with OpenGL buffer. I tried to shrunk the problem to the minimum source code, so I created program that increment each number of the FloatBuffer in each iteration. When I am adding less than 2^16 float numbers to the FloatBuffer, everything works just fine, but when I add >= 2^16 numbers, then the numbers are not incrementing and stays the same in each iteration.
Renderer:
public class Renderer extends AbstractRenderer {
int computeShaderProgram;
int[] locBuffer = new int[2];
FloatBuffer data;
int numbersCount = 65_536, round = 0; // 65_535 - OK, 65_536 - wrong
#Override
public void init() {
computeShaderProgram = ShaderUtils.loadProgram(null, null, null, null, null,
"/main/computeBuffer");
glGenBuffers(locBuffer);
// dataSizeInBytes = count of numbers to sort * (float=4B + padding=3*4B)
int dataSizeInBytes = numbersCount * (1 + 3) * 4;
data = ByteBuffer.allocateDirect(dataSizeInBytes)
.order(ByteOrder.nativeOrder())
.asFloatBuffer();
initBuffer();
printBuffer(data);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, locBuffer[0]);
glBufferData(GL_SHADER_STORAGE_BUFFER, data, GL_DYNAMIC_DRAW);
glShaderStorageBlockBinding(computeShaderProgram, 0, 0);
glViewport(0, 0, width, height);
}
private void initBuffer() {
data.rewind();
Random r = new Random();
for (int i = 0; i < numbersCount; i++) {
data.put(i*4, r.nextFloat());
}
}
#Override
public void display() {
if (round < 5) {
glUseProgram(computeShaderProgram);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, locBuffer[0]);
glDispatchCompute(numbersCount, 1, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER, 0, data);
printBuffer(data);
round++;
}
}
...
}
Compute buffer
#version 430
#extension GL_ARB_compute_shader : enable
#extension GL_ARB_shader_storage_buffer_object : enable
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) buffer Input {
float elements[];
}input_data;
void main () {
input_data.elements[gl_WorkGroupID.x ] = input_data.elements[gl_WorkGroupID.x] + 1;
}
glDispatchCompute(numbersCount, 1, 1);
You must not dispatch a compute shader workgroup count exceeding the corresponding GL_MAX_GL_MAX_COMPUTE_WORK_GROUP_COUNT for each dimension. The spec guarantees that limit to be at least 65535, so it is very likely that you just exceed the limit on your implementation. Actually, you should be getting a GL_INVALID_VALUE error for that call, and you should really consider using a debug context and debug message callback to have such obvious errors easily spotted during development.

Find the maximum float in the array

I have a compute shader program which looks for the maximum value in the float array. it uses reduction (compare two values and save the bigger one to the output buffer).
Now I am not quite sure how to run this program from the Java code (using jogamp). In the display() method I run the program once (every time with the halved array in the input SSBO = result from previous iteration) and finish this when the array with results has only one item - the maximum.
Is this the correct method? Every time in the display() method creating and binding input and output SSBO, running the shader program and then check how many items was returned?
Java code:
FloatBuffer inBuffer = Buffers.newDirectFloatBuffer(array);
gl.glBindBuffer(GL3ES3.GL_SHADER_STORAGE_BUFFER, buffersNames.get(1));
gl.glBufferData(GL3ES3.GL_SHADER_STORAGE_BUFFER, itemsCount * Buffers.SIZEOF_FLOAT, inBuffer,
GL3ES3.GL_STREAM_DRAW);
gl.glBindBufferBase(GL3ES3.GL_SHADER_STORAGE_BUFFER, 1, buffersNames.get(1));
gl.glDispatchComputeGroupSizeARB(groupsCount, 1, 1, groupSize, 1, 1);
gl.glMemoryBarrier(GL3ES3.GL_SHADER_STORAGE_BARRIER_BIT);
ByteBuffer output = gl.glMapNamedBuffer(buffersNames.get(1), GL3ES3.GL_READ_ONLY);
Shader code:
#version 430
#extension GL_ARB_compute_variable_group_size : enable
layout (local_size_variable) in;
layout(std430, binding = 1) buffer MyData {
vec4 elements[];
} data;
void main() {
uint index = gl_GlobalInvocationID.x;
float n1 = data.elements[index].x;
float n2 = data.elements[index].y;
float n3 = data.elements[index].z;
float n4 = data.elements[index].w;
data.elements[index].x = max(max(n1, n2), max(n3, n4));
}

Convert GLSL to C or C++

as an exercise I am trying to convert GLSL shaders into plain c/c++ than can be executed via the CPU instead of the GPU, regardless if this is much less efficient and slower.
Given that the data in c/c++ will be stored into an unsigned int array of pixels, how can I convert the next line to something that will perform the same operation in plain c?
// GLSL
vec2 test = vec2(0.5, 0.2);
vec2 coord = vec2(0.5, 0.5);
vec3 output_color = texture2D(u_texture, coord - test).rgb
I could only get up to this
// C/C++
short vec2_test_x = 127; // Equivalent to 0.5
short vec2_test_y = 51; // Equivalent to 0.2
short vec2_coord_x = 127; // Equivalent to 0.5
short vec2_coord_y = 127; // Equivalent to 0.5
short color_r, color_g, color_b;
int output_color = pixels[.... No idea how to continue....]
......
What you are asking about is a memory-mapping-function. The following formula will do the trick:
int output_color = pixels[vec2_coord_y * 256 + vec2_coord_x];
//assuming output_color is stored in format XXBBGGRR:
color_r = (output_color & 0x000000FF);
color_g = (output_color & 0x0000FF00) >> 8;
color_b = (output_color & 0x00FF0000) >> 16;