Find the maximum float in the array - opengl

I have a compute shader program which looks for the maximum value in the float array. it uses reduction (compare two values and save the bigger one to the output buffer).
Now I am not quite sure how to run this program from the Java code (using jogamp). In the display() method I run the program once (every time with the halved array in the input SSBO = result from previous iteration) and finish this when the array with results has only one item - the maximum.
Is this the correct method? Every time in the display() method creating and binding input and output SSBO, running the shader program and then check how many items was returned?
Java code:
FloatBuffer inBuffer = Buffers.newDirectFloatBuffer(array);
gl.glBindBuffer(GL3ES3.GL_SHADER_STORAGE_BUFFER, buffersNames.get(1));
gl.glBufferData(GL3ES3.GL_SHADER_STORAGE_BUFFER, itemsCount * Buffers.SIZEOF_FLOAT, inBuffer,
GL3ES3.GL_STREAM_DRAW);
gl.glBindBufferBase(GL3ES3.GL_SHADER_STORAGE_BUFFER, 1, buffersNames.get(1));
gl.glDispatchComputeGroupSizeARB(groupsCount, 1, 1, groupSize, 1, 1);
gl.glMemoryBarrier(GL3ES3.GL_SHADER_STORAGE_BARRIER_BIT);
ByteBuffer output = gl.glMapNamedBuffer(buffersNames.get(1), GL3ES3.GL_READ_ONLY);
Shader code:
#version 430
#extension GL_ARB_compute_variable_group_size : enable
layout (local_size_variable) in;
layout(std430, binding = 1) buffer MyData {
vec4 elements[];
} data;
void main() {
uint index = gl_GlobalInvocationID.x;
float n1 = data.elements[index].x;
float n2 = data.elements[index].y;
float n3 = data.elements[index].z;
float n4 = data.elements[index].w;
data.elements[index].x = max(max(n1, n2), max(n3, n4));
}

Related

OpenGL how to get offset of array element in shared layout uniform block?

I have a shared layout uniform block in shader:
layout(shared) uniform TestBlock
{
int test[5];
};
How to get offset of test[3]?
When I try to use glGetUniformIndices to get index of test[3], it will return the same number of test[0]'s index.
So I cannot use glGetActiveUniformsiv to get offset of index of test[3].
Then, how to get offset of test[3]?
(Note that I don't want to use layout std140.)
Arrays of basic types like ints are treated as a single value. You can't get the offset of an individual element in the array. You can however query the array stride, the number of bytes from one element in the array to the next. Then you can just do the multiplication.
Using the new program introspection API:
auto ix = glGetProgramResourceIndex(prog, GL_UNIFORM, "TestBlock.test");
GLenum props[] = {GL_ARRAY_STRIDE, GL_OFFSET};
GLint values[2] = {};
glGetProgramResourceiv(prog, GL_UNIFORM, ix, 2, &props, 2, NULL, &values);
auto byteOffset = values[1] + (3 * values[0]);

I'm experiencing very slow OpenGL compute shader compilation (10+ minutes) when using larger work groups, is there anything I can do to speed it up?

So, I'm encountering a really bizarre (at least to me as a compute shader noob) phenomenon when I compile my compute shader using glGetShaderiv(m_shaderID, GL_COMPILE_STATUS, &status). Inexplicably, my compute shader takes much longer to compile when I increase the size of my work groups! When I have one-dimensional work groups, it compiles in less than a second, but when I increase the size of my work groups to 4x1x6, the compute shader takes 10+ minutes to compile! How strange.
For background, I'm trying to implement a light clustering algorithm (essentially the one shown here: http://www.aortiz.me/2018/12/21/CG.html#tiled-shading--forward), and my compute shader is this monster:
// TODO: Figure out optimal tile size, currently using a 16x9x24 subdivision
#define FLT_MAX 3.402823466e+38
#define FLT_MIN 1.175494351e-38
#define DBL_MAX 1.7976931348623158e+308
#define DBL_MIN 2.2250738585072014e-308
layout(local_size_x = 4, local_size_y = 9, local_size_z = 4) in;
// TODO: Change to reflect my light structure
// struct PointLight{
// vec4 position;
// vec4 color;
// uint enabled;
// float intensity;
// float range;
// };
// TODO: Pack this more efficiently
struct Light {
vec4 position;
vec4 direction;
vec4 ambientColor;
vec4 diffuseColor;
vec4 specularColor;
vec4 attributes;
vec4 intensity;
ivec4 typeIndexAndFlags;
// uint flags;
};
// Array containing offset and number of lights in a cluster
struct LightGrid{
uint offset;
uint count;
};
struct VolumeTileAABB{
vec4 minPoint;
vec4 maxPoint;
};
layout(std430, binding = 0) readonly buffer LightBuffer {
Light data[];
} lightBuffer;
layout (std430, binding = 1) buffer clusterAABB{
VolumeTileAABB cluster[ ];
};
layout (std430, binding = 2) buffer screenToView{
mat4 inverseProjection;
uvec4 tileSizes;
uvec2 screenDimensions;
};
// layout (std430, binding = 3) buffer lightSSBO{
// PointLight pointLight[];
// };
// SSBO of active light indices
layout (std430, binding = 4) buffer lightIndexSSBO{
uint globalLightIndexList[];
};
layout (std430, binding = 5) buffer lightGridSSBO{
LightGrid lightGrid[];
};
layout (std430, binding = 6) buffer globalIndexCountSSBO{
uint globalIndexCount;
};
// Shared variables, shared between all invocations WITHIN A WORK GROUP
// TODO: See if I can use gl_WorkGroupSize for this, gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z
// A grouped-shared array which contains all the lights being evaluated
shared Light sharedLights[4*9*4]; // A grouped-shared array which contains all the lights being evaluated, size is thread-count
uniform mat4 viewMatrix;
bool testSphereAABB(uint light, uint tile);
float sqDistPointAABB(vec3 point, uint tile);
bool testConeAABB(uint light, uint tile);
float getLightRange(uint lightIndex);
bool isEnabled(uint lightIndex);
// Runs in batches of multiple Z slices at once
// In this implementation, 6 batches, since each thread group contains four z slices (24/4=6)
// We begin by each thread representing a cluster
// Then in the light traversal loop they change to representing lights
// Then change again near the end to represent clusters
// NOTE: Tiles actually mean clusters, it's just a legacy name from tiled shading
void main(){
// Reset every frame
globalIndexCount = 0; // How many lights are active in t his scene
uint threadCount = gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z; // Number of threads in a group, same as local_size_x, local_size_y, local_size_z
uint lightCount = lightBuffer.data.length(); // Number of total lights in the scene
uint numBatches = uint((lightCount + threadCount -1) / threadCount); // Number of groups of lights that will be completed, i.e., number of passes
uint tileIndex = gl_LocalInvocationIndex + gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z * gl_WorkGroupID.z;
// uint tileIndex = gl_GlobalInvocationID; // doesn't wortk, is uvec3
// Local thread variables
uint visibleLightCount = 0;
uint visibleLightIndices[100]; // local light index list, to be transferred to global list
// Every light is being checked against every cluster in the view frustum
// TODO: Perform active cluster determination
// Each individual thread will be responsible for loading a light and writing it to shared memory so other threads can read it
for( uint batch = 0; batch < numBatches; ++batch){
uint lightIndex = batch * threadCount + gl_LocalInvocationIndex;
//Prevent overflow by clamping to last light which is always null
lightIndex = min(lightIndex, lightCount);
//Populating shared light array
// NOTE: It is VERY important that lightBuffer.data not be referenced after this point,
// since that is not thread-safe
sharedLights[gl_LocalInvocationIndex] = lightBuffer.data[lightIndex];
barrier(); // Synchronize read/writes between invocations within a work group
//Iterating within the current batch of lights
for( uint light = 0; light < threadCount; ++light){
if( isEnabled(light)){
uint lightType = uint(sharedLights[light].typeIndexAndFlags[0]);
if(lightType == 0){
// Point light
if( testSphereAABB(light, tileIndex) ){
visibleLightIndices[visibleLightCount] = batch * threadCount + light;
visibleLightCount += 1;
}
}
else if(lightType == 1){
// Directional light
visibleLightIndices[visibleLightCount] = batch * threadCount + light;
visibleLightCount += 1;
}
else if(lightType == 2){
// Spot light
if( testConeAABB(light, tileIndex) ){
visibleLightIndices[visibleLightCount] = batch * threadCount + light;
visibleLightCount += 1;
}
}
}
}
}
// We want all thread groups to have completed the light tests before continuing
barrier();
// Back to every thread representing a cluster
// Adding the light indices to the cluster light index list
uint offset = atomicAdd(globalIndexCount, visibleLightCount);
for(uint i = 0; i < visibleLightCount; ++i){
globalLightIndexList[offset + i] = visibleLightIndices[i];
}
// Updating the light grid for each cluster
lightGrid[tileIndex].offset = offset;
lightGrid[tileIndex].count = visibleLightCount;
}
// Return whether or not the specified light intersects with the specified tile (cluster)
bool testSphereAABB(uint light, uint tile){
float radius = getLightRange(light);
vec3 center = vec3(viewMatrix * sharedLights[light].position);
float squaredDistance = sqDistPointAABB(center, tile);
return squaredDistance <= (radius * radius);
}
// TODO: Different test for spot-lights
// Has been done by using several AABBs for spot-light cone, this could be a good approach, or even just use one to start.
bool testConeAABB(uint light, uint tile){
// Light light = lightBuffer.data[lightIndex];
// float innerAngleCos = light.attributes[0];
// float outerAngleCos = light.attributes[1];
// float innerAngle = acos(innerAngleCos);
// float outerAngle = acos(outerAngleCos);
// FIXME: Actually do something clever here
return true;
}
// Get range of light given the specified light index
float getLightRange(uint lightIndex){
int lightType = sharedLights[lightIndex].typeIndexAndFlags[0];
float range;
if(lightType == 0){
// Point light
float brightness = 0.01; // cutoff for end of range
float c = sharedLights[lightIndex].attributes.x;
float lin = sharedLights[lightIndex].attributes.y;
float quad = sharedLights[lightIndex].attributes.z;
range = (-lin + sqrt(lin*lin - 4.0 * c * quad + (4.0/brightness)* quad)) / (2.0 * quad);
}
else if(lightType == 1){
// Directional light
range = FLT_MAX;
}
else{
// Spot light
range = FLT_MAX;
}
return range;
}
// Whether the light at the specified index is enabled
bool isEnabled(uint lightIndex){
uint flags = sharedLights[lightIndex].typeIndexAndFlags[2];
return (flags | 1) != 0;
}
// Get squared distance from a point to the AABB of the specified tile (cluster)
float sqDistPointAABB(vec3 point, uint tile){
float sqDist = 0.0;
VolumeTileAABB currentCell = cluster[tile];
cluster[tile].maxPoint[3] = tile;
for(int i = 0; i < 3; ++i){
float v = point[i];
if(v < currentCell.minPoint[i]){
sqDist += (currentCell.minPoint[i] - v) * (currentCell.minPoint[i] - v);
}
if(v > currentCell.maxPoint[i]){
sqDist += (v - currentCell.maxPoint[i]) * (v - currentCell.maxPoint[i]);
}
}
return sqDist;
}
Edit: Whoops, lost the bottom part of this!
What I don't understand is why changing the size of the work groups affects compilation time at all? It sort of defeats the point of the algorithm if my work group sizes are too small for the compute shader to run efficiently, so I'm hoping there's something that I'm missing.
As a last note, I'd like to avoid using glGetProgramBinary as a solution. Not only because it merely circumvents the issue instead of solving it, but because pre-compiling shaders will not play nicely with the engine's current architecture.
So, I'm figuring that this must be a bug in the compiler, since I've replaced the loop in my sqDistPointAABB function with:
vec3 minPoint = currentCell.minPoint.xyz;
vec3 maxPoint = currentCell.maxPoint.xyz;
vec3 t1 = vec3(lessThan(point, minPoint));
vec3 t2 = vec3(greaterThan(point, maxPoint));
vec3 sqDist = t1 * (minPoint - point) * (minPoint - point) + t2 * (maxPoint - point) * (maxPoint - point);
return sqDist.x + sqDist.y + sqDist.z;
And it compiles just fine now, in less than a second! So strange

Shader storage block name issue

Something weird is happening with my shader storage blocks.
I have 2 SSBs:
#version 450 core
out vec4 out_color;
layout (binding = 0, std430) buffer A_SSB
{
float a_data[];
};
layout (binding = 1, std430) buffer B_SSB
{
float b_data[];
};
void main()
{
a_data[0] = 0.0f;
a_data[1] = 1.0f;
a_data[2] = 2.0f;
a_data[3] = 3.0f;
b_data[0] = 90.0f;
b_data[1] = 81.0f;
b_data[2] = 72.0f;
b_data[3] = 63.0f;
out_color = vec4(0.0f, 0.8f, 1.0f, 1.0f);
}
This is working well, but if i swap the SSB names like that:
layout (binding = 0, std430) buffer B_SSB
{
float a_data[];
};
layout (binding = 1, std430) buffer A_SSB
{
float b_data[];
};
the SSB indexes are swapped although they are hardcoded and data which should be written to a_data is written to b_data and vice versa.
Both SSBs are 250MB large, the max size is more than 2GB. It seems that the indexes are allocated alphabetical but this shouldn't happen. I'm binding the buffers like that:
glCreateBuffers(1, &a_ssb);
glNamedBufferStorage(a_ssb, 7187400 * 9 * sizeof(float), nullptr, GL_MAP_READ_BIT);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, a_ssb);
glShaderStorageBlockBinding(test_prog, 0, 0);
glCreateBuffers(1, &b_ssb);
glNamedBufferStorage(b_ssb, 7187400 * 9 * sizeof(float), nullptr, GL_MAP_READ_BIT);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, b_ssb);
glShaderStorageBlockBinding(test_prog, 1, 1);
Is this a bug or my fault? Also i would like to ask why i'm getting the error "lvalue in array access too complex or possible array index out of bounds" if i'm assigning values in a for loop?
for(unsigned int i = 0; i < 4; ++i)
a_data[i] = float(i);
glShaderStorageBlockBinding(test_prog, 0, 0);
This is your problem.
You assigned the binding index in the shader. You do not need to assign it again.
Your problem comes from the fact that you assigned it incorrectly.
The second parameter to this function is the index of the block you are assigning a binding index to. The only way to get a correct index is to query it via Program Introspection APIs. The block index is the resource index, queried through this call:
auto block_index = glGetProgramResourceIndex​(test_prog, GL_SHADER_STORAGE_BLOCK, "A_SSB");
It just so happened, in your original code, that the shader compiler assigned A_SSB's resource index to 0 and B_SSB's resource index to 1. This assignment was probably arbitrarily done based on their names. Thus, when you changed the names on them, the resource indices didn't change. So A_SSB was still resource index 0, but your shader assigned it binding index 1. Which was fine...
Until your C++ code overrode that assignment with your glShaderStorageBlockBinding(test_prog, 0, 0). That assigned resource index 0 (A_SSB) to binding index 0.
You should either set the binding index in the shader or in C++ code. Not in both.

Only first Compute Shader array element appears updated

Trying to send an array of integer to a compute shader, sets an arbitrary value to each integer and then reads back on CPU/HOST. The problem is that only the first element of my array gets updated. My array is initialized with all elements = 5 in the CPU, then I try to sets all the values to 2 in the Compute Shader:
C++ Code:
this->numOfElements = std::vector<int> numOfElements; //num of elements for each voxel
//Set the reset grid program as current program
glUseProgram(this->resetGridProgHandle);
//Binds and fill the buffer
glBindBuffer(GL_SHADER_STORAGE_BUFFER, this->counterBufferHandle);
glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(int) * numOfVoxels, this->numOfElements.data(), GL_DYNAMIC_DRAW);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, this->counterBufferHandle);
//Flag used in the buffer map function
GLint bufMask = GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT;
//calc maximum size for workgroups
//glGetIntegerv(GL_MAX_COMPUTE_WORK_GROUP_SIZE, &result);
//Executes the compute shader
glDispatchCompute(32, 1, 1); //
glMemoryBarrier(GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT);
//Gets a pointer to the returned data
int* returnArray = (int *)glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_WRITE);
//Free the buffer mapping
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
Shader:
#version 430
layout (local_size_x = 32) in;
layout(binding = 0) buffer SSBO{
int counter[];
};
void main(){
counter[gl_WorkGroupID.x * gl_WorkGroupSize.x + gl_LocalInvocationID.x] = 2;
}
If I print returnArray[0] it's 2 (correct), but any index > 0 gives me 5, which is the initial value initialized in host.
glMemoryBarrier(GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT);
//Gets a pointer to the returned data
int* returnArray = (int *)glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_WRITE);
The bit you use for glMemoryBarrier represents the way you want to read the data written by the shader. GL_VERTEX_ATTRIB_ARRAY_BARRIER_BIT says "I'm going to read this written data by using the buffer for vertex attribute arrays". In reality, you are going to read the buffer by mapping it.
So you should use the proper barrier bit:
glMemoryBarrier(GL_BUFFER_UPDATE_BARRIER_BIT);
Ok i had this problem as well
i changed:
layout(binding = 0) buffer SSBO{
to:
layout(binding = 0, std430) buffer SSBO{

Order independent transparency with MSAA

I have implemented OIT based on the demo in "OpenGL Programming Guide" 8th edition.(The red book).Now I need to add MSAA.Just enabling MSAA screws up the transparency as the layered pixels are resolved x times equal to the number of sample levels.I have read this article on how it is done with DirectX where they say the pixel shader should be run per sample and not per pixel.How id it done in OpenGL.
I won't put out here the whole implementation but the fragment shader chunk in which the final resolution of the layered pixels occurs:
vec4 final_color = vec4(0,0,0,0);
for (i = 0; i < fragment_count; i++)
{
/// Retrieving the next fragment from the stack:
vec4 modulator = unpackUnorm4x8(fragment_list[i].y) ;
/// Perform alpha blending:
final_color = mix(final_color, modulator, modulator.a);
}
color = final_color ;
Update:
I have tried the solution proposed here but it still doesn't work.Here are the full fragment shader for the list build and resolve passes:
List build pass :
#version 420 core
layout (early_fragment_tests) in;
layout (binding = 0, r32ui) uniform uimage2D head_pointer_image;
layout (binding = 1, rgba32ui) uniform writeonly uimageBuffer list_buffer;
layout (binding = 0, offset = 0) uniform atomic_uint list_counter;
layout (location = 0) out vec4 color;//dummy output
in vec3 frag_position;
in vec3 frag_normal;
in vec4 surface_color;
in int gl_SampleMaskIn[];
uniform vec3 light_position = vec3(40.0, 20.0, 100.0);
void main(void)
{
uint index;
uint old_head;
uvec4 item;
vec4 frag_color;
index = atomicCounterIncrement(list_counter);
old_head = imageAtomicExchange(head_pointer_image, ivec2(gl_FragCoord.xy), uint(index));
vec4 modulator =surface_color;
item.x = old_head;
item.y = packUnorm4x8(modulator);
item.z = floatBitsToUint(gl_FragCoord.z);
item.w = int(gl_SampleMaskIn[0]);
imageStore(list_buffer, int(index), item);
frag_color = modulator;
color = frag_color;
}
List resolve :
#version 420 core
// The per-pixel image containing the head pointers
layout (binding = 0, r32ui) uniform uimage2D head_pointer_image;
// Buffer containing linked lists of fragments
layout (binding = 1, rgba32ui) uniform uimageBuffer list_buffer;
// This is the output color
layout (location = 0) out vec4 color;
// This is the maximum number of overlapping fragments allowed
#define MAX_FRAGMENTS 40
// Temporary array used for sorting fragments
uvec4 fragment_list[MAX_FRAGMENTS];
void main(void)
{
uint current_index;
uint fragment_count = 0;
current_index = imageLoad(head_pointer_image, ivec2(gl_FragCoord).xy).x;
while (current_index != 0 && fragment_count < MAX_FRAGMENTS )
{
uvec4 fragment = imageLoad(list_buffer, int(current_index));
int coverage = int(fragment.w);
//if((coverage &(1 << gl_SampleID))!=0) {
fragment_list[fragment_count] = fragment;
current_index = fragment.x;
//}
fragment_count++;
}
uint i, j;
if (fragment_count > 1)
{
for (i = 0; i < fragment_count - 1; i++)
{
for (j = i + 1; j < fragment_count; j++)
{
uvec4 fragment1 = fragment_list[i];
uvec4 fragment2 = fragment_list[j];
float depth1 = uintBitsToFloat(fragment1.z);
float depth2 = uintBitsToFloat(fragment2.z);
if (depth1 < depth2)
{
fragment_list[i] = fragment2;
fragment_list[j] = fragment1;
}
}
}
}
vec4 final_color = vec4(0,0,0,0);
for (i = 0; i < fragment_count; i++)
{
vec4 modulator = unpackUnorm4x8(fragment_list[i].y);
final_color = mix(final_color, modulator, modulator.a);
}
color = final_color;
}
Without knowing how your code actually works, you can do it very much the same way that your linked DX11 demo does, since OpenGL provides the same features needed.
So in the first shader that just stores all the rendered fragments, you also store the sample coverage mask for each fragment (along with the color and depth, of course). This is given as fragment shader input variable int gl_SampleMaskIn[] and for each sample with id 32*i+j, bit j of glSampleMaskIn[i] is set if the fragment covers that sample (since you probably won't use >32xMSAA, you can usually just use glSampleMaskIn[0] and only need to store a single int as coverage mask).
...
fragment.color = inColor;
fragment.depth = gl_FragCoord.z;
fragment.coverage = gl_SampleMaskIn[0];
...
Then the final sort and render shader is run for each sample instead of just for each fragment. This is achieved implicitly by making use of the input variable int gl_SampleID, which gives us the ID of the current sample. So what we do in this shader (in addition to the non-MSAA version) is that the sorting step just accounts for the sample, by only adding a fragment to the final (to be sorted) fragment list if the current sample is actually covered by this fragment:
What was something like (beware, pseudocode extrapolated from your small snippet and the DX-link):
while(fragment.next != 0xFFFFFFFF)
{
fragment_list[count++] = vec2(fragment.depth, fragment.color);
fragment = fragments[fragment.next];
}
is now
while(fragment.next != 0xFFFFFFFF)
{
if(fragment.coverage & (1 << gl_SampleID))
fragment_list[count++] = vec2(fragment.depth, fragment.color);
fragment = fragments[fragment.next];
}
Or something along those lines.
EDIT: To your updated code, you have to increment fragment_count only inside the if(covered) block, since we don't want to add the fragment to the list if the sample is not covered. Incrementing it always will likely result in the artifacts you see at the edges, which are the regions where the MSAA (and thus the coverage) comes into play.
On the other hand the list pointer has to be forwarded (current_index = fragment.x) in each loop iteration and not only if the sample is covered, as otherwise it can result in an infinite loop, like in your case. So your code should look like:
while (current_index != 0 && fragment_count < MAX_FRAGMENTS )
{
uvec4 fragment = imageLoad(list_buffer, int(current_index));
uint coverage = fragment.w;
if((coverage &(1 << gl_SampleID))!=0)
fragment_list[fragment_count++] = fragment;
current_index = fragment.x;
}
The OpenGL 4.3 Spec states in 7.1 about the gl_SampleID builtin variable:
Any static use of this variable in a fragment shader causes the entire shader to be evaluated per-sample.
(This has already been the case in the ARB_sample_shading and is also the case for gl_SamplePosition or a custom variable declared with the sample qualifier)
Therefore it is quite automatic, because you will probably need the SampleID anyway.