Constant buffer members access the same memory - c++

I'm using a constant buffer to pass data to my shaders at every frame, and I'm running into an issue where the values of some of the members of the buffer point to the same memory.
When I use the Visual Studio 2012 debugging tools, it looks like the data is being set in the buffer more or less correctly:
0 [0x00000000-0x00000003] | +0
1 [0x00000004-0x00000007] | +1
2 [0x00000008-0x0000000b] | +1
3 [0x0000000c-0x0000000f] | +1
4 [0x00000010-0x00000013] | +0.78539819
5 [0x00000014-0x00000017] | +1.1760513
6 [0x00000018-0x0000001b] | +0
7 [0x0000001c-0x0000001f] | +1
The problem is that when I debug the shader, the sunAngle and phaseFunction both have the same value - specifically 0.78539819, which should be the value of sunAngle only. It does change to 1.1760513 if I swap the order of the two floats, but both will still be the same. I thought I'd packed everything together correctly, but am I missing how to define exactly what constants are in each part of the buffer?
Here's the C++ structure I'm using:
struct SunData {
DirectX::XMFLOAT4 sunPosition;
float sunAngle;
float phaseFunctionResult;
And the shader buffer looks like this:
// updated as the sun moves through the sky
cbuffer sunDependent : register( b1 )
float4 sunPosition;
float sunAngle; // theta
float phaseFunctionResult; // F( theta, g )
Here's the code I'm using to initialize the buffer:
XMVECTOR pos = XMVectorSet( 0, 1, 1, 1 );
XMStoreFloat3( &_sunPosition, pos );
XMStoreFloat4( &_sun.sunPosition, pos );
_sun.sunAngle = XMVectorGetX(
XMVector3AngleBetweenVectors( pos, XMVectorSet( 0, 1, 0, 0 ) )
_sun.phaseFunctionResult = _planet.phaseFunction( _sun.sunAngle );
// Fill in a buffer description.
cbDesc.ByteWidth = sizeof( SunData ) + 8;
cbDesc.Usage = D3D11_USAGE_DYNAMIC;
cbDesc.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
cbDesc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
cbDesc.MiscFlags = 0;
cbDesc.StructureByteStride = 0;
// Fill in the subresource data.
data.pSysMem = &_sun;
data.SysMemPitch = 0;
data.SysMemSlicePitch = 0;
// Create the buffer.
ID3D11Buffer *constantBuffer = nullptr;
HRESULT hr = _d3dDevice->CreateBuffer(
assert( SUCCEEDED( hr ) );
// Set the buffer.
_d3dDeviceContext->VSSetConstantBuffers( 1, 1, &constantBuffer );
_d3dDeviceContext->PSSetConstantBuffers( 1, 1, &constantBuffer );
Release( constantBuffer );
And here's the pixel shader that's using the values:
float4 main( in ATMOS_PS_INPUT input ) : SV_TARGET
float R = sunAngle * sunPosition.x * sunIntensity.x
* attenuationCoefficient.x
* phaseFunctionResult;
return float4( R, 1, 1, 1 );

It looks like a padding issue like in this question: Question
All constant buffers should be sized to be dividble by sizeof(four-component vector) (doc)


Weird compute shader latency

I'm trying to make frustrum culling via compute shader. For that I have a pair of buffers for instanced vertex attributes, and a pair of buffers for indirect draw commands. My compute shader checks if instance coordinates from first buffer are within bounding volume, referencing first draw buffer for counts, subgroupBallot and bitCount to see offset within subgroup, then add results from other subgroups and a global offset, and finally stores the result in second buffer. The global offset is stored in second indirect draw buffer.
The problem is that, when under load, frustum may be few(>1) frames late to the moving camera, with wide lines of disappeared objects on edge. It seems weird to me because culling and rendering are done within same command buffer.
When taking capture in renderdoc, taking a screenshot alt+printScreen, or pausing the render-present thread, things snap back to as they should be.
My only guess is that compute shader from past frame continues to execute even when new frame starts to be drawn, though this should not be happening due to pipeline barriers.
Shader code:
#version 460
#extension GL_KHR_shader_subgroup_ballot : require
struct drawData{
uint indexCount;
uint instanceCount;
uint firstIndex;
uint vertexOffset;
uint firstInstance;
struct instanceData{
float x, y, z;
float a, b, c, d;
layout(local_size_x = 128, local_size_y = 1, local_size_z = 1) in;
layout(set = 0, binding = 0) uniform A
mat4 cam;
vec4 camPos;
vec4 l;
vec4 t;
vec4 r;
vec4 b;
layout(set = 0, binding = 1) buffer B
uint count;
drawData data[];
} Draw[2];
layout(set = 0, binding = 2) buffer C
instanceData data[];
} Instance[2];
shared uint offsetsM[32];
void main()
const uint gID = gl_LocalInvocationID.x;
const uint lID = gl_SubgroupInvocationID;
const uint patchSize = gl_WorkGroupSize.x;
Draw[1].data[0] = Draw[0].data[0];//copy data like index count
Draw[1].count = Draw[0].count;
uint offsetG = 0;//accumulating offset within end buffer
uint loops = Draw[0].data[0].instanceCount/patchSize;//constant loop count
for(uint i = 0; i<loops;++i){
uint posa = i*patchSize+gID;//runs better this way for some reason
vec3 pos =[0].data[posa].x, Instance[0].data[posa].y, Instance[0].data[posa].z);//position relative to camera
mat4x3 lrtb = mat4x3(,,,;
vec4 dist = pos*[0].rad;//dot products and radius tolerance
bool Pass = posa<Draw[0].data[0].instanceCount&&//is real
(dot(pos, pos)<l.w*l.w) &&//not too far
all(greaterThan(dist, vec4(0))); //within view frustum
subgroupBarrier();//no idea what is the best, put what works
uvec4 actives = subgroupBallot(Pass);//count passed instances
offsetsM[gl_SubgroupID] = bitCount(actives).x+bitCount(actives).y;
uint offsetL = bitCount(actives&gl_SubgroupLtMask).x+bitCount(actives&gl_SubgroupLtMask).y;//offset withing subgroup
uint ii = 0;
for(; ii<gl_SubgroupID; ++ii)
offsetG+= offsetsM[ii];//offsets before subgroup
Instance[1].data[offsetG+offsetL] = Instance[0].data[posa];
for(; ii<gl_NumSubgroups; ++ii)
offsetG+= offsetsM[ii];}//offsets after subgroup
else for(; ii<gl_NumSubgroups; ++ii)
offsetG+= offsetsM[ii];//same but no data copying
if(gID == 0)
Draw[1].data[0].instanceCount = offsetG;
For renderpass after the compute I have dependencies:
deps[1].srcSubpass = VK_SUBPASS_EXTERNAL;
deps[1].dstSubpass = 0;
deps[1].srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
deps[1].dependencyFlags = 0;
deps[2].srcSubpass = VK_SUBPASS_EXTERNAL;
deps[2].dstSubpass = 0;
deps[2].srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
deps[2].dependencyFlags = 0;
The command buffer is(fully reused as is, one for each image in swapchain):
vkBeginCommandBuffer(cmd, &begInfo);
vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_COMPUTE, layoutsPipe[1],
0, 1, &descs[1], 0, 0);
vkCmdBindPipeline(cmd, VK_PIPELINE_BIND_POINT_COMPUTE, pipes[1]);
vkCmdDispatch(cmd, 1, 1, 1);
VkBufferMemoryBarrier bufMemBar[2];
{//mem bars
{//0 indirect
bufMemBar[0].srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
bufMemBar[0].buffer = bufferIndirect;
bufMemBar[0].offset = 0;
bufMemBar[0].size = -1;
{//1 vertex instance
bufMemBar[1].srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
bufMemBar[1].buffer = bufferInstance;
bufMemBar[1].offset = 0;
bufMemBar[1].size = -1;
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, 0, 0, 0, 1, &bufMemBar[0], 0, 0);
VK_PIPELINE_STAGE_VERTEX_INPUT_BIT , 0, 0, 0, 1, &bufMemBar[1], 0, 0);
VkRenderPassBeginInfo passBegInfo;
passBegInfo.renderPass = pass;
passBegInfo.framebuffer = chain.frames[i];
passBegInfo.renderArea = {{0, 0}, chain.dim};
VkClearValue clears[2]{{0},{0}};
passBegInfo.clearValueCount = 2;
passBegInfo.pClearValues = clears;
vkCmdBeginRenderPass(cmd, &passBegInfo, VK_SUBPASS_CONTENTS_INLINE);
vkCmdBindDescriptorSets(cmd, VK_PIPELINE_BIND_POINT_GRAPHICS, layoutsPipe[0], 0, 1, &descs[0], 0, 0);
vkCmdBindPipeline (cmd, VK_PIPELINE_BIND_POINT_GRAPHICS, pipes[0]);
VkBuffer buffersVertex[2]{bufferVertexProto, bufferInstance};
VkDeviceSize offsetsVertex[2]{0, 0};
vkCmdBindVertexBuffers(cmd, 0, 2, buffersVertex, offsetsVertex);
vkCmdBindIndexBuffer (cmd, bufferIndex, 0, VK_INDEX_TYPE_UINT32);
vkCmdDrawIndexedIndirectCount(cmd, bufferIndirect, 0+4,
bufferIndirect, 0,
count.maxDraws, sizeof(VkDrawIndexedIndirectCommand));
Rendering and presentation are synchronised with two semaphores - imageAvailable, and renderFinished. Frustum calculation is in right order on CPU. Validation layers are enabled.
The problem was that I lacked host synchronisation. Indeed, even within same command buffer, there are no host synchronisation guarantees (and that makes sense, since it enables us to use events).

Loading non-power-of-two textures in Vulkan

My 2D texture loader works fine if my texture dimensions are power-of-two, but when they are not, the texture data displays as skewed. How do I fix this? I assume the issue has something to do with memory alignment and row pitch. Here's relevant parts of my loader code:
VkMemoryRequirements memReqs;
vkGetImageMemoryRequirements( GfxDeviceGlobal::device, mappableImage, &memReqs );
VkMemoryAllocateInfo memAllocInfo = {};
memAllocInfo.pNext = nullptr;
memAllocInfo.memoryTypeIndex = 0;
memAllocInfo.allocationSize = memReqs.size;
GetMemoryType( memReqs.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, &memAllocInfo.memoryTypeIndex );
VkDeviceMemory mappableMemory;
err = vkAllocateMemory( GfxDeviceGlobal::device, &memAllocInfo, nullptr, &mappableMemory );
CheckVulkanResult( err, "vkAllocateMemory in Texture2D" );
err = vkBindImageMemory( GfxDeviceGlobal::device, mappableImage, mappableMemory, 0 );
CheckVulkanResult( err, "vkBindImageMemory in Texture2D" );
VkImageSubresource subRes = {};
subRes.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
subRes.mipLevel = 0;
subRes.arrayLayer = 0;
VkSubresourceLayout subResLayout;
vkGetImageSubresourceLayout( GfxDeviceGlobal::device, mappableImage, &subRes, &subResLayout );
void* mapped;
err = vkMapMemory( GfxDeviceGlobal::device, mappableMemory, 0, memReqs.size, 0, &mapped );
CheckVulkanResult( err, "vkMapMemory in Texture2D" );
const int bytesPerPixel = 4;
std::size_t dataSize = bytesPerPixel * width * height;
std::memcpy( mapped, data, dataSize );
vkUnmapMemory( GfxDeviceGlobal::device, mappableMemory );
The VkSubresourceLayout, which you obtained from vkGetImageSubresourceLayout will contain the pitch of the texture in the rowPitch member. It's more than likely not equal to the width, thus, when you do a memcpy of the entire data block, you're copying relevant data into the padding section of the texture.
Instead you will need to memcpy row-by-row, skipping the padding memory in the mapped texture:
const int bytesPerPixel = 4;
std::size_t dataRowSize = bytesPerPixel * width;
char* mappedBytes = (char*)mapped;
for(int i = 0; i < height; ++i)
std::memcpy(mapped, data, dataSize);
mappedBytes += rowPitch;
data += dataRowSize;
(this code assumes data is a char * as well - its declaration wasn't given)
for(int i = 0; i < height; ++i)
std::memcpy(mappedBytes, data, dataRowSize);
mappedBytes += layout.rowPitch;
data += dataRowSize;

Different vertex/index formats in a single buffer

In my scene, I have several static models that never change. Some models only have float3 vertices and 16 bit indices, some are more complex, have colors and normals in vertices, and 32-bit indices.
Q1. Can I combine them all in a single vertex and index buffers, and draw like this:
// Model #1 has only position and 16 bit indices
const int model1verts = 20, model1indices = 100;
UINT stride1 = sizeof( Vector3 ), offset1 = 0;
context->IASetVertexBuffers( 0, 1, &vb, &stride1, &offset1 );
context->IASetIndexBuffer( ib, DXGI_FORMAT_R16_UINT, 0 );
context->DrawIndexed( model1indices, 0, 0 );
// Set another shader + input layout here
// Model #2 has positions + normals, and 32 bit indices
const int model2verts = 200, model2indices = 500;
UINT stride2 = sizeof( Vector3 ) * 2, offset2 = stride1 * model1verts;
context->IASetVertexBuffers( 0, 1, &vb, &stride2, &offset2 );
context->IASetIndexBuffer( ib, DXGI_FORMAT_R32_UINT, model1indices * sizeof( uint16_t ) );
context->DrawIndexed( model2indices, 0, 0 );
// 20 more models to follow, all from the same buffers
Q2. AFAIK, GPUs loves aligned data. When calling IASetVertexBuffers/IASetIndexBuffer, those offsets, should they be a multiple of 4 or 16 bytes? Documentation doesn’t say that.
Q3. Should I do that at all? Will this save resources compared to 20-100 smaller buffers each of it’s own format?

Rendering Multiline Text with NVPath Extension and Pango

I'm using Pango to layout my text and NV Path to render glyphs.
Having difficulty in finding correct methods for getting per glyph positions. As you can see at the moment I'm calculating this values according to line and glyph indexes.
But Pango has better methods for this; like per glyph, per line, extent queries. My problem is that this methods got no documentation and I wasn't able to find any samples.
How can i get correct glyph positions from Pango for this type of application?
std::vector<uint32_t> glyphs;
std::vector<GLfloat> positions;
int lineCount = pango_layout_get_line_count( pangoLayout );
for ( int l = 0; l < lineCount; ++l )
PangoLayoutLine* line = pango_layout_get_line_readonly( pangoLayout, l );
GSList* runs = line->runs;
float xOffset = 0.0f;
while( runs )
PangoLayoutRun* run = static_cast<PangoLayoutRun*>( runs->data );
glyphs.resize( run->glyphs->num_glyphs, 0 );
positions.resize( run->glyphs->num_glyphs * 2, 0 );
for( int g = 0; g < run->glyphs->num_glyphs; ++g )
glyphs[g] = run->glyphs->glyphs[g].glyph;
// Need Correct Values Here
positions[ g * 2 + 0 ] = xOffset * NVPATH_DEFUALT_EMSCALE;
positions[ g * 2 + 1 ] = (float)l * NVPATH_DEFUALT_EMSCALE;
xOffset += PANGO_PIXELS( run->glyphs->glyphs[g].geometry.width ) / getFontSize();
const Font::RefT font = getFont( pango_font_description_get_family( pango_font_describe( run->item->analysis.font ) ) );
glEnable( GL_STENCIL_TEST );
glStencilFillPathInstancedNV( run->glyphs->num_glyphs,
glStencilFunc( GL_NOTEQUAL, 0, 0xFF );
glStencilOp( GL_KEEP, GL_KEEP, GL_ZERO );
glColor3f( 0.0, 0.0, 0.0 );
glCoverFillPathInstancedNV( run->glyphs->num_glyphs,
glDisable( GL_STENCIL_TEST );
runs = runs->next;

DX11 Compute Shader writes only to one index

I really can't figure out what's going on here.
I have a compute shader that takes in an FFT result (from real input) and computes the powers of each bin, storing them in a different buffer (UAV). The FFT implementation is that of the D3DCSX library.
The shader in question:
struct Complex {
float real;
float imag;
RWStructuredBuffer<Complex> g_result : register(u0);
RWStructuredBuffer<float> g_powers : register(u1);
[numthreads(1, 1, 1)] void main(uint3 id : SV_DispatchThreadID) {
const uint bin = id.x;
const float real = g_result[bin + 1].real;
const float imag = g_result[bin + 1].imag;
const float power = real * real + imag * imag;
const float mag = sqrt(power);
const float db = 10.0f * log10(1.0f + power);
g_powers[bin] = power;
The buffer creation code:
//The buffer in which the resulting powers are stored (m_result_buffer1)
buffer_desc.ByteWidth = sizeof(float) * NumBins();
buffer_desc.CPUAccessFlags = 0;
buffer_desc.StructureByteStride = sizeof(float);
buffer_desc.Usage = D3D11_USAGE_DEFAULT;
hr = m_device->CreateBuffer (
); HR_THROW();
//UAV for m_result_buffer1
view_desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
view_desc.Buffer.FirstElement = 0;
view_desc.Format = DXGI_FORMAT_R32_TYPELESS;
view_desc.Buffer.Flags = D3D11_BUFFER_UAV_FLAG_RAW;
view_desc.Buffer.NumElements = NumBins();
hr = m_device->CreateUnorderedAccessView (
); HR_THROW();
//Buffer for reading powers to the CPU
buffer_desc.BindFlags = 0;
buffer_desc.ByteWidth = sizeof(float) * NumBins();
buffer_desc.CPUAccessFlags = D3D11_CPU_ACCESS_READ;
buffer_desc.MiscFlags = 0;
buffer_desc.StructureByteStride = sizeof(float);
buffer_desc.Usage = D3D11_USAGE_STAGING;
hr = m_device->CreateBuffer (
); HR_THROW();
The dispatch code:
CComPtr<ID3D11UnorderedAccessView> result_view;
hr = m_fft->ForwardTransform (
); HR_THROW();
ID3D11UnorderedAccessView* views[] = {
result_view, //FFT UAV (u0)
m_result_view //Power UAV (u1)
m_context->CSSetShader(m_power_cs, nullptr, 0);
m_context->CSSetUnorderedAccessViews(0, 2, views, nullptr);
m_context->Dispatch(NumBins(), 1, 1);
And finally the CPU mapping code:
m_context->CopyResource(m_result_buffer2, m_result_buffer1);
m_context->Map(m_result_buffer2, 0, D3D11_MAP_READ, 0, &sub);
memcpy(result, sub.pData, sizeof(float) * NumBins());
m_context->Unmap(m_result_buffer2, 0);
What happens is this shader appears to have every thread write to the same index in the output buffer. The mapped buffer always reads a correct value for the first bin, then 0.0f for every other bin. The equivalent code on the CPU runs just fine. What's weird is I've placed conditionals and know that bin is not just 0 all the time, and that the power of every bin outside bin 0 is also not always 0.0f. I've also tried writing to multiple bins on the same thread using a for loop, and the same thing happens. What am I doing wrong?
I have a hunch that it's the buffer creation code or mapping code that's at the root of the problem. I know I'm running the correct number of threads on the GPU and that the dispatch ID's are correct, it's the CPU-side result that's wrong.
Problem Solved!
I was using a RWStructuredBuffer to represent a RWByteOrderBuffer. Not entirely sure how that led to this result, but it did. So, the FFT result is now a RWByteOrderBuffer. What was strange about this buffer, though, was the fact that the D3DCSX implementation spaced the float values so far apart - possibly for cache reasons, but I'm honestly not too sure why. This is my compute shader now (computing decibels instead of powers this time - an unrelated change):
RWByteAddressBuffer g_result : register(u0);
RWStructuredBuffer<float> g_decibels : register(u1);
[numthreads(256, 1, 1)] void main(uint3 id : SV_DispatchThreadID) {
const float real = asfloat(g_result.Load(id.x * 8 + 0));
const float imag = asfloat(g_result.Load(id.x * 8 + 4));
const float power = real * real + imag * imag;
const float db = 10.0f * log10(1.0f + power);
g_decibels[id.x] = db;
I changed my decibel buffer's description to that of a structured buffer, though, just to make things easier for me:
buffer_desc.ByteWidth = sizeof(float) * NumBins();
buffer_desc.CPUAccessFlags = 0;
buffer_desc.StructureByteStride = sizeof(float);
buffer_desc.Usage = D3D11_USAGE_DEFAULT;
hr = m_device->CreateBuffer (
); HR_THROW();
view_desc.Buffer.FirstElement = 0;
view_desc.Buffer.Flags = 0;
view_desc.Buffer.NumElements = NumBins();
view_desc.Format = DXGI_FORMAT_UNKNOWN;
view_desc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
hr = m_device->CreateUnorderedAccessView (
); HR_THROW();
This is why g_decibels is still a RWStructuredBuffer.
Still unknown to me is whether or not it matters that the result buffer is read/write when only accesses are necessary - if I change g_result to a regular ByteOrderBuffer I get no output. But at least it's working now.