OpenGL buffer problem when adding >= 2^16 numbers - opengl

I'm facing some strange difficulties with OpenGL buffer. I tried to shrunk the problem to the minimum source code, so I created program that increment each number of the FloatBuffer in each iteration. When I am adding less than 2^16 float numbers to the FloatBuffer, everything works just fine, but when I add >= 2^16 numbers, then the numbers are not incrementing and stays the same in each iteration.
Renderer:
public class Renderer extends AbstractRenderer {
int computeShaderProgram;
int[] locBuffer = new int[2];
FloatBuffer data;
int numbersCount = 65_536, round = 0; // 65_535 - OK, 65_536 - wrong
#Override
public void init() {
computeShaderProgram = ShaderUtils.loadProgram(null, null, null, null, null,
"/main/computeBuffer");
glGenBuffers(locBuffer);
// dataSizeInBytes = count of numbers to sort * (float=4B + padding=3*4B)
int dataSizeInBytes = numbersCount * (1 + 3) * 4;
data = ByteBuffer.allocateDirect(dataSizeInBytes)
.order(ByteOrder.nativeOrder())
.asFloatBuffer();
initBuffer();
printBuffer(data);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, locBuffer[0]);
glBufferData(GL_SHADER_STORAGE_BUFFER, data, GL_DYNAMIC_DRAW);
glShaderStorageBlockBinding(computeShaderProgram, 0, 0);
glViewport(0, 0, width, height);
}
private void initBuffer() {
data.rewind();
Random r = new Random();
for (int i = 0; i < numbersCount; i++) {
data.put(i*4, r.nextFloat());
}
}
#Override
public void display() {
if (round < 5) {
glUseProgram(computeShaderProgram);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, locBuffer[0]);
glDispatchCompute(numbersCount, 1, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER, 0, data);
printBuffer(data);
round++;
}
}
...
}
Compute buffer
#version 430
#extension GL_ARB_compute_shader : enable
#extension GL_ARB_shader_storage_buffer_object : enable
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0) buffer Input {
float elements[];
}input_data;
void main () {
input_data.elements[gl_WorkGroupID.x ] = input_data.elements[gl_WorkGroupID.x] + 1;
}

glDispatchCompute(numbersCount, 1, 1);
You must not dispatch a compute shader workgroup count exceeding the corresponding GL_MAX_GL_MAX_COMPUTE_WORK_GROUP_COUNT for each dimension. The spec guarantees that limit to be at least 65535, so it is very likely that you just exceed the limit on your implementation. Actually, you should be getting a GL_INVALID_VALUE error for that call, and you should really consider using a debug context and debug message callback to have such obvious errors easily spotted during development.

Related

Weird compute shader latency

I'm trying to make frustrum culling via compute shader. For that I have a pair of buffers for instanced vertex attributes, and a pair of buffers for indirect draw commands. My compute shader checks if instance coordinates from first buffer are within bounding volume, referencing first draw buffer for counts, subgroupBallot and bitCount to see offset within subgroup, then add results from other subgroups and a global offset, and finally stores the result in second buffer. The global offset is stored in second indirect draw buffer.
The problem is that, when under load, frustum may be few(>1) frames late to the moving camera, with wide lines of disappeared objects on edge. It seems weird to me because culling and rendering are done within same command buffer.
When taking capture in renderdoc, taking a screenshot alt+printScreen, or pausing the render-present thread, things snap back to as they should be.
My only guess is that compute shader from past frame continues to execute even when new frame starts to be drawn, though this should not be happening due to pipeline barriers.
Shader code:
#version 460
#extension GL_KHR_shader_subgroup_ballot : require
struct drawData{
uint indexCount;
uint instanceCount;
uint firstIndex;
uint vertexOffset;
uint firstInstance;
};
struct instanceData{
float x, y, z;
float a, b, c, d;
};
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
layout(set = 0, binding = 0) uniform A
{
mat4 cam;
vec4 camPos;
vec4 l;
vec4 t;
vec4 r;
vec4 b;
};
layout(set = 0, binding = 1) buffer B
{
uint count;
drawData data[];
} Draw[2];
layout(set = 0, binding = 2) buffer C
{
instanceData data[];
} Instance[2];
shared uint offsetsM[32];
void main()
{
const uint gID = gl_LocalInvocationID.x;
const uint lID = gl_SubgroupInvocationID;
const uint patchSize = gl_WorkGroupSize.x;
Draw[1].data[0] = Draw[0].data[0];//copy data like index count
Draw[1].count = Draw[0].count;
uint offsetG = 0;//accumulating offset within end buffer
uint loops = Draw[0].data[0].instanceCount/patchSize;//constant loop count
for(uint i = 0; i<loops;++i){
uint posa = i*patchSize+gID;//runs better this way for some reason
vec3 pos = camPos.xyz-vec3(Instance[0].data[posa].x, Instance[0].data[posa].y, Instance[0].data[posa].z);//position relative to camera
mat4x3 lrtb = mat4x3(l.xyz, r.xyz, t.xyz, b.xyz);
vec4 dist = pos*lrtb+Model.data[0].rad;//dot products and radius tolerance
bool Pass = posa<Draw[0].data[0].instanceCount&&//is real
(dot(pos, pos)<l.w*l.w) &&//not too far
all(greaterThan(dist, vec4(0))); //within view frustum
subgroupBarrier();//no idea what is the best, put what works
uvec4 actives = subgroupBallot(Pass);//count passed instances
if(subgroupElect())
offsetsM[gl_SubgroupID] = bitCount(actives).x+bitCount(actives).y;
barrier();
uint offsetL = bitCount(actives&gl_SubgroupLtMask).x+bitCount(actives&gl_SubgroupLtMask).y;//offset withing subgroup
uint ii = 0;
if(Pass){
for(; ii<gl_SubgroupID; ++ii)
offsetG+= offsetsM[ii];//offsets before subgroup
Instance[1].data[offsetG+offsetL] = Instance[0].data[posa];
for(; ii<gl_NumSubgroups; ++ii)
offsetG+= offsetsM[ii];}//offsets after subgroup
else for(; ii<gl_NumSubgroups; ++ii)
offsetG+= offsetsM[ii];//same but no data copying
}
if(gID == 0)
Draw[1].data[0].instanceCount = offsetG;
}
For renderpass after the compute I have dependencies:
{//1
deps[1].srcSubpass = VK_SUBPASS_EXTERNAL;
deps[1].dstSubpass = 0;
deps[1].srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
deps[1].dstStageMask = VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT;
deps[1].srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
deps[1].dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
deps[1].dependencyFlags = 0;
}
{//2
deps[2].srcSubpass = VK_SUBPASS_EXTERNAL;
deps[2].dstSubpass = 0;
deps[2].srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
deps[2].dstStageMask = VK_PIPELINE_STAGE_VERTEX_INPUT_BIT;
deps[2].srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
deps[2].dstAccessMask = VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT;
deps[2].dependencyFlags = 0;
}
The command buffer is(fully reused as is, one for each image in swapchain):
vkBeginCommandBuffer(cmd, &begInfo);
vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_COMPUTE, layoutsPipe[1],
0, 1, &descs[1], 0, 0);
vkCmdBindPipeline(cmd, VK_PIPELINE_BIND_POINT_COMPUTE, pipes[1]);
vkCmdDispatch(cmd, 1, 1, 1);
VkBufferMemoryBarrier bufMemBar[2];
{//mem bars
{//0 indirect
bufMemBar[0].srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
bufMemBar[0].dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
bufMemBar[0].buffer = bufferIndirect;
bufMemBar[0].offset = 0;
bufMemBar[0].size = -1;
}
{//1 vertex instance
bufMemBar[1].srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
bufMemBar[1].dstAccessMask = VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT;
bufMemBar[1].buffer = bufferInstance;
bufMemBar[1].offset = 0;
bufMemBar[1].size = -1;
}
}
vkCmdPipelineBarrier(cmd, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, 0, 0, 0, 1, &bufMemBar[0], 0, 0);
vkCmdPipelineBarrier(cmd, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
VK_PIPELINE_STAGE_VERTEX_INPUT_BIT , 0, 0, 0, 1, &bufMemBar[1], 0, 0);
VkRenderPassBeginInfo passBegInfo;
passBegInfo.renderPass = pass;
passBegInfo.framebuffer = chain.frames[i];
passBegInfo.renderArea = {{0, 0}, chain.dim};
VkClearValue clears[2]{{0},{0}};
passBegInfo.clearValueCount = 2;
passBegInfo.pClearValues = clears;
vkCmdBeginRenderPass(cmd, &passBegInfo, VK_SUBPASS_CONTENTS_INLINE);
vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_GRAPHICS, layoutsPipe[0], 0, 1, &descs[0], 0, 0);
vkCmdBindPipeline (cmd, VK_PIPELINE_BIND_POINT_GRAPHICS, pipes[0]);
VkBuffer buffersVertex[2]{bufferVertexProto, bufferInstance};
VkDeviceSize offsetsVertex[2]{0, 0};
vkCmdBindVertexBuffers(cmd, 0, 2, buffersVertex, offsetsVertex);
vkCmdBindIndexBuffer (cmd, bufferIndex, 0, VK_INDEX_TYPE_UINT32);
vkCmdDrawIndexedIndirectCount(cmd, bufferIndirect, 0+4,
bufferIndirect, 0,
count.maxDraws, sizeof(VkDrawIndexedIndirectCommand));
vkCmdEndRenderPass(cmd);
vkEndCommandBuffer(cmd);
Rendering and presentation are synchronised with two semaphores - imageAvailable, and renderFinished. Frustum calculation is in right order on CPU. Validation layers are enabled.
The problem was that I lacked host synchronisation. Indeed, even within same command buffer, there are no host synchronisation guarantees (and that makes sense, since it enables us to use events).

Adding an extra UBO to a vulkan pipeline stops all geometry rendering

I've followed the tutorial at www.vulkan-tutorial.com and I'm trying to split the Uniform buffer into 2 seperate buffers, one for View and Projection and one for Model. I've found however once I add another buffer to the layout, even if my shaders don't use it's content, no geometry is rendered. I don't get anything from the validation layers.
I've found that if the two UBOs are the same buffer, I have no problem. But if I assign them to different buffers, nothing appears on the screen. Have added descriptor set generation code.
Here's my layout generation code. All values are submitted correctly, bindings are 0, 1 and 2 respectively and this is reflected in shader code. I'm currently not even using the data in the buffer in the shader - so it's got nothing to do with the data I'm actually putting in the buffer.
Edit: Have opened up in RenderDoc. Without the extra buffer, I can see the normal VP buffer and it's values. They look fine. If I add in the extra buffer, it does not show up, but also the data from the first buffer is all zeroes.
Descriptor Set Layout generation:
std::vector<VkDescriptorSetLayoutBinding> layoutBindings;
/*
newShader->features includes 3 "features", with bindings 0,1,2.
They are - uniform buffer, uniform buffer, sampler
vertex bit, vertex bit, fragment bit
*/
for (auto a : newShader->features)
{
VkDescriptorSetLayoutBinding newBinding = {};
newBinding.descriptorType = (VkDescriptorType)layoutBindingDescriptorType(a.featureType);
newBinding.binding = a.binding;
newBinding.stageFlags = (VkShaderStageFlags)layoutBindingStageFlag(a.stage);
newBinding.descriptorCount = 1;
newBinding.pImmutableSamplers = nullptr;
layoutBindings.push_back(newBinding);
}
VkDescriptorSetLayoutCreateInfo layoutCreateInfo = {};
layoutCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
layoutCreateInfo.bindingCount = static_cast<uint32_t>(layoutBindings.size());
layoutCreateInfo.pBindings = layoutBindings.data();
Descriptor Set Generation:
//Create a list of layouts
std::vector<VkDescriptorSetLayout> layouts(swapChainImages.size(), voa->shaderPipeline->shaderSetLayout);
//Allocate room for the descriptors
VkDescriptorSetAllocateInfo allocInfo = {};
allocInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
allocInfo.descriptorPool = voa->shaderPipeline->descriptorPool;
allocInfo.descriptorSetCount = static_cast<uint32_t>(swapChainImages.size());
allocInfo.pSetLayouts = layouts.data();
voa->descriptorSets.resize(swapChainImages.size());
if (vkAllocateDescriptorSets(vdi->device, &allocInfo, voa->descriptorSets.data()) != VK_SUCCESS) {
throw std::runtime_error("failed to allocate descriptor sets!");
}
//For each set of commandBuffers (frames in flight +1)
for (size_t i = 0; i < swapChainImages.size(); i++) {
std::vector<VkWriteDescriptorSet> descriptorWrites;
//Buffer Info construction
for (auto a : voa->renderComponent->getMaterial()->shader->features)
{
//Create a new descriptor write
uint32_t index = descriptorWrites.size();
descriptorWrites.push_back({});
descriptorWrites[index].dstBinding = a.binding;
if (a.featureType == HE2_SHADER_FEATURE_TYPE_UNIFORM_BLOCK)
{
VkDescriptorBufferInfo bufferInfo = {};
if (a.bufferSource == HE2_SHADER_BUFFER_SOURCE_VIEW_PROJECTION_BUFFER)
{
bufferInfo.buffer = viewProjectionBuffers[i];
bufferInfo.offset = 0;
bufferInfo.range = sizeof(ViewProjectionBuffer);
}
else if (a.bufferSource == HE2_SHADER_BUFFER_SOURCE_MODEL_BUFFER)
{
bufferInfo.buffer = modelBuffers[i];
bufferInfo.offset = voa->ID * sizeof(ModelBuffer);
bufferInfo.range = sizeof(ModelBuffer);
}
//The following is the same for all Uniform buffers
descriptorWrites[index].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
descriptorWrites[index].dstSet = voa->descriptorSets[i];
descriptorWrites[index].dstArrayElement = 0;
descriptorWrites[index].descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
descriptorWrites[index].descriptorCount = 1;
descriptorWrites[index].pBufferInfo = &bufferInfo;
}
else if (a.featureType == HE2_SHADER_FEATURE_TYPE_SAMPLER2D)
{
VulkanImageReference ref = VulkanTextures::images[a.imageHandle];
VkDescriptorImageInfo imageInfo = {};
imageInfo.imageLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
imageInfo.imageView = ref.imageView;
imageInfo.sampler = defaultSampler;
descriptorWrites[index].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
descriptorWrites[index].dstSet = voa->descriptorSets[i];
descriptorWrites[index].dstArrayElement = 0;
descriptorWrites[index].descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;
descriptorWrites[index].descriptorCount = 1;
descriptorWrites[index].pImageInfo = &imageInfo;
}
else
{
throw std::runtime_error("Unsupported feature type present in shader");
}
}
vkUpdateDescriptorSets(vdi->device, static_cast<uint32_t>(descriptorWrites.size()), descriptorWrites.data(), 0, nullptr);
}
Edit: Here is descriptor set binding code
vkCmdBeginRenderPass(commandBuffers[i], &renderPassInfo, VK_SUBPASS_CONTENTS_INLINE);
//Very temporary Render loop. Binds every frame, very clumsy
for (int j = 0; j < max; j++)
{
VulkanObjectAttachment* voa = objectAttachments[j];
VulkanModelAttachment* vma = voa->renderComponent->getModel()->getComponent<VulkanModelAttachment>();
if (vma->indices == 0) continue;
vkCmdBindPipeline(commandBuffers[i], VK_PIPELINE_BIND_POINT_GRAPHICS, voa->shaderPipeline->pipeline);
VkBuffer vertexBuffers[] = { vma->vertexBuffer };
VkDeviceSize offsets[] = { 0 };
vkCmdBindVertexBuffers(commandBuffers[i], 0, 1, vertexBuffers, offsets);
vkCmdBindIndexBuffer(commandBuffers[i], vma->indexBuffer, 0, VK_INDEX_TYPE_UINT32);
vkCmdBindDescriptorSets(commandBuffers[i], VK_PIPELINE_BIND_POINT_GRAPHICS, voa->shaderPipeline->pipelineLayout, 0, 1, &voa->descriptorSets[i], 0, nullptr);
vkCmdDrawIndexed(commandBuffers[i], static_cast<uint32_t>(vma->indices), 1, 0, 0, 0);
}
vkCmdEndRenderPass(commandBuffers[i]);
Buffer updating code:
ViewProjectionBuffer ubo = {};
ubo.view = HE2_Camera::main->getCameraMatrix();
ubo.proj = HE2_Camera::main->getProjectionMatrix();
ubo.proj[1][1] *= -1;
ubo.model = a->object->getModelMatrix();
void* data;
vmaMapMemory(allocator, a->mvpAllocations[i], &data);
memcpy(data, &ubo, sizeof(ubo));
vmaUnmapMemory(allocator, a->mvpAllocations[i]);
}
std::vector<ModelBuffer> modelBuffersData;
for (VulkanObjectAttachment* voa : objectAttachments)
{
ModelBuffer mb = {};
mb.model = voa->object->getModelMatrix();
modelBuffersData.push_back(mb);
void* data;
vmaMapMemory(allocator, modelBuffersAllocation[i], &data);
memcpy(data, &modelBuffersData, sizeof(ModelBuffer) * modelBuffersData.size());
vmaUnmapMemory(allocator, modelBuffersAllocation[i]);
I found the problem - not a Vulkan issue but a C++ syntax one sadly. I'll explain it anyway but likely to not be your issue if you're visiting this page in the future.
I generate my descriptor writes in a loop. They're stored in a vector and then updated at the end of the loop
std::vector<VkDescriptorWrite> descriptorWrites;
for(int i = 0; i < shader.features.size); i++)
{
//Various stuff to the descriptor write
}
vkUpdateDescriptorSets(vdi->device, static_cast<uint32_t>(descriptorWrites.size()), descriptorWrites.data(), 0, nullptr);
One parameter of the descriptor write is pImageInfo or pBufferInfo. These point to a struct that contains specific data for that buffer or image. I filled these in within the loop
{//Within the loop above
//...
VkDescriptorBufferInfo bufferInfo = {};
bufferInfo.buffer = myBuffer;
descriptorWrites[i].pBufferInfo = &bufferInfo;
//...
}
Because these are passed by reference, not value, the descriptorWrite when being updated refers to the data in the original struct. But because the original struct was made in a loop, and the vkUpdateDescriptors line is outside of the loop, by the time that struct is read it's out of scope and deleted.
While this should result in undefined behaviour, I can only imagine because there's no new variables between the end of the loop and the update call, the memory still read the contents of the last descriptorWrite in the loop. So all descriptors read that memory, and had the resources from the last descriptorWrite pushed to them. Fixed it all just by putting the VkDescriptorBufferInfos in a vector of their own at the start of the loop.
It looks to me like the offset you're setting here is causing the VkWriteDescriptorSet to read overflow memory:
else if (a.bufferSource == HE2_SHADER_BUFFER_SOURCE_MODEL_BUFFER)
{
bufferInfo.buffer = modelBuffers[i];
bufferInfo.offset = voa->ID * sizeof(ModelBuffer);
bufferInfo.range = sizeof(ModelBuffer);
}
If you were only updating part of a buffer every frame, you'd do something like this:
bufferInfo.buffer = mvpBuffer[i];
bufferInfo.offset = sizeof(mat4[]{viewMat, projMat});
bufferInfo.range = sizeof(modelMat);
If you place the model in another buffer, you probably want to create a different binding for your descriptor set and your bufferInfo for your model data would look like this:
bufferInfo.buffer = modelBuffer[i];
bufferInfo.offset = 0;
bufferInfo.range = sizeof(modelMat);

DirectX - Writing to 3D Texture Causing Display Driver Failure

I'm testing writing to 2D and 3D textures in compute shaders, outputting a gradient noise texture consisting of 32 bit floats. Writing to a 2D texture works fine, but writing to a 3D texture isn't. Are there additional considerations that need to be made when creating a 3D texture when compared to a 2D texture?
Code of how I'm defining the 3D texture below:
HRESULT BaseComputeShader::CreateTexture3D(UINT width, UINT height, UINT depth, DXGI_FORMAT format, ID3D11Texture3D** texture)
{
D3D11_TEXTURE3D_DESC textureDesc;
ZeroMemory(&textureDesc, sizeof(textureDesc));
textureDesc.Width = width;
textureDesc.Height = height;
textureDesc.Depth = depth;
textureDesc.MipLevels = 1;
textureDesc.Format = format;
textureDesc.Usage = D3D11_USAGE_DEFAULT;
textureDesc.BindFlags = D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS;
textureDesc.CPUAccessFlags = 0;
textureDesc.MiscFlags = 0;
return renderer->CreateTexture3D(&textureDesc, 0, texture);
}
HRESULT BaseComputeShader::CreateTexture3DUAV(UINT depth, DXGI_FORMAT format, ID3D11Texture3D** texture, ID3D11UnorderedAccessView** unorderedAccessView)
{
D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
ZeroMemory(&uavDesc, sizeof(uavDesc));
uavDesc.Format = format;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE3D;
uavDesc.Texture3D.MipSlice = 0;
uavDesc.Texture3D.FirstWSlice = 0;
uavDesc.Texture3D.WSize = depth;
return renderer->CreateUnorderedAccessView(*texture, &uavDesc, unorderedAccessView);
}
HRESULT BaseComputeShader::CreateTexture3DSRV(DXGI_FORMAT format, ID3D11Texture3D** texture, ID3D11ShaderResourceView** shaderResourceView)
{
D3D11_SHADER_RESOURCE_VIEW_DESC srvDesc;
ZeroMemory(&srvDesc, sizeof(srvDesc));
srvDesc.Format = format;
srvDesc.ViewDimension = D3D11_SRV_DIMENSION_TEXTURE3D;
srvDesc.Texture3D.MostDetailedMip = 0;
srvDesc.Texture3D.MipLevels = 1;
return renderer->CreateShaderResourceView(*texture, &srvDesc, shaderResourceView);
}
And how I'm writing to it in the compute shader:
// The texture we're writing to
RWTexture3D<float> outputTexture : register(u0);
[numthreads(8, 8, 8)]
void main(uint3 DTid : SV_DispatchThreadID)
{
float noiseValue = 0.0f;
float value = 0.0f;
float localAmplitude = amplitude;
float localFrequency = frequency;
// Loop for the number of octaves, running the noise function as many times as desired (8 is usually sufficient)
for (int k = 0; k < octaves; k++)
{
noiseValue = noise(float3(DTid.x * localFrequency, DTid.y * localFrequency, DTid.z * localFrequency)) * localAmplitude;
value += noiseValue;
// Calculate a new amplitude based on the input persistence/gain value
// amplitudeLoop will get smaller as the number of layers (i.e. k) increases
localAmplitude *= persistence;
// Calculate a new frequency based on a lacunarity value of 2.0
// This gives us 2^k as the frequency
// i.e. Frequency at k = 4 will be f * 2^4 as we have looped 4 times
localFrequency *= 2.0f;
}
// Output value to 2D index in the texture provided by thread indexing
outputTexture[DTid.xyz] = value;
}
And finally, how I'm running the shader:
// Set the shader
deviceContext->CSSetShader(computeShader, nullptr, 0);
// Set the shader's buffers and views
deviceContext->CSSetConstantBuffers(0, 1, &cBuffer);
deviceContext->CSSetUnorderedAccessViews(0, 1, &textureUAV, nullptr);
// Launch the shader
deviceContext->Dispatch(512, 512, 512);
// Reset the shader now we're done
deviceContext->CSSetShader(nullptr, nullptr, 0);
// Reset the shader views
ID3D11UnorderedAccessView* ppUAViewnullptr[1] = { nullptr };
deviceContext->CSSetUnorderedAccessViews(0, 1, ppUAViewnullptr, nullptr);
// Create the shader resource view for access in other shaders
HRESULT result = CreateTexture3DSRV(DXGI_FORMAT_R32_FLOAT, &texture, &textureSRV);
if (result != S_OK)
{
MessageBox(NULL, L"Failed to create texture SRV after compute shader execution", L"Failed", MB_OK);
exit(0);
}
My bad, simple mistake. Compute shader threads are limited in number. In the compute shader you're limited to a total of 1024 threads, and the dispatch call cannot dispatch more than 65535 thread groups. The HLSL compiler will catch the former issue, but the Visual C++ compiler will not catch the latter issue.
If you create a texture of 512 * 512 * 512 (which seems what you are trying to achieve), your dispatch needs to be divided by groups:
deviceContext->Dispatch(512 / 8, 512 / 8, 512 / 8);
In your previous case, the dispatch was :
512*8 * 512*8 * 512*8 = 68719476736 units
Which very likely triggered the time out detection and crashes the driver
Also the limit of 65535 is per dimension, so in your case you are completely safe to run this.
And last one, you can create both shader resource view and unordered view right after creating your 3d texture (before the dispatch call).
This is generally recommended to avoid mixing context code and resource creation code.
On resource creation, your check is not valid either :
if (result != S_OK)
HRESULT success condition is >= 0
you can use the built in macro instead eg :
if (SUCCEEDED(result))

Find the maximum float in the array

I have a compute shader program which looks for the maximum value in the float array. it uses reduction (compare two values and save the bigger one to the output buffer).
Now I am not quite sure how to run this program from the Java code (using jogamp). In the display() method I run the program once (every time with the halved array in the input SSBO = result from previous iteration) and finish this when the array with results has only one item - the maximum.
Is this the correct method? Every time in the display() method creating and binding input and output SSBO, running the shader program and then check how many items was returned?
Java code:
FloatBuffer inBuffer = Buffers.newDirectFloatBuffer(array);
gl.glBindBuffer(GL3ES3.GL_SHADER_STORAGE_BUFFER, buffersNames.get(1));
gl.glBufferData(GL3ES3.GL_SHADER_STORAGE_BUFFER, itemsCount * Buffers.SIZEOF_FLOAT, inBuffer,
GL3ES3.GL_STREAM_DRAW);
gl.glBindBufferBase(GL3ES3.GL_SHADER_STORAGE_BUFFER, 1, buffersNames.get(1));
gl.glDispatchComputeGroupSizeARB(groupsCount, 1, 1, groupSize, 1, 1);
gl.glMemoryBarrier(GL3ES3.GL_SHADER_STORAGE_BARRIER_BIT);
ByteBuffer output = gl.glMapNamedBuffer(buffersNames.get(1), GL3ES3.GL_READ_ONLY);
Shader code:
#version 430
#extension GL_ARB_compute_variable_group_size : enable
layout (local_size_variable) in;
layout(std430, binding = 1) buffer MyData {
vec4 elements[];
} data;
void main() {
uint index = gl_GlobalInvocationID.x;
float n1 = data.elements[index].x;
float n2 = data.elements[index].y;
float n3 = data.elements[index].z;
float n4 = data.elements[index].w;
data.elements[index].x = max(max(n1, n2), max(n3, n4));
}

Compute Shader - gl_GlobalInvocationID and local_size

While trying to implement a naive Compute Shader that assigns affecting lights to a cluster, i have encountered an unexpected(well for a noob like me) behavior:
I invoke this shader with glDispatchCompute(32, 32, 32); and it supposed to write a [light counter + 8 indices] for each invocation into "indices" buffer. But while debugging, i found that my writes into that buffer overlap between invocations even though I use unique clusterId. I detect it by values of indices[outIndexStart] going over 8 and visual flickering.
According to documentation, gl_GlobalInvocationID is gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID. But if set all local sizes to 1, write issues go away. Why local_size affects this code in such a way? And how can i reason about choosing it's value here?
#version 430
layout (local_size_x = 4, local_size_y = 4, local_size_z = 4) in;
uniform int lightCount;
const unsigned int clusterSize = 32;
const unsigned int clusterSquared = clusterSize * clusterSize;
struct LightInfo {
vec4 color;
vec3 position;
float radius;
};
layout(std430, binding = 0) buffer occupancyGrid {
int exists[];
};
layout(std430, binding = 2) buffer lightInfos
{
LightInfo lights [];
};
layout(std430, binding = 1) buffer outputList {
int indices[];
};
void main(){
unsigned int clusterId = gl_GlobalInvocationID.x + gl_GlobalInvocationID.y * clusterSize + gl_GlobalInvocationID.z * clusterSquared;
if(exists[clusterId] == 0)
return;
//... not so relevant calculations
unsigned int outIndexStart = clusterId * 9;
unsigned int outOffset = 1;
for(int i = 0; i < lightCount && outOffset < 9; i++){
if(distance(lights[i].position, wordSpace.xyz) < lights[i].radius) {
indices[outIndexStart + outOffset] = i;
indices[outIndexStart]++;
outOffset++;
}
}
}
Let's look at two declarations:
layout (local_size_x = 4, local_size_y = 4, local_size_z = 4) in;
and
const unsigned int clusterSize = 32;
These say different things. The local_size declaration says that each work group will have 4x4x4 invocations, which is 64. By contrast, your clusterSize says that each work group will only have 32 invocations.
If you want to fix this problem, use the actual local size constant provided by the system:
const unsigned int clusterSize = gl_WorkGroupSize.x * gl_WorkGroupSize.y * gl_WorkGroupSize.z;
And you can even do this:
const uvec3 linearizeInvocation = uvec3{1, clusterSize, clusterSize * clusterSize};
...
unsigned int clusterId = dot(gl_GlobalInvocationID, linearizeInvocation);