How to count dead particles in the compute shader? - c++

I am working on particle system. For calculation of each particle position, time alive and so on I use compute shader. I have problem to get count of dead particles back to the cpu, so I can set how many particles renderer should render. To store data of particles i use shader storage buffer. To render particles i use instancing. I tried to use atomic buffer counter, it works fine, but it is slow to copy data from buffer to the cpu. I wonder if there is some other option.
This is important part of compute shader
if (pData.timeAlive >= u_LifeTime)
{
pData.velocity = pData.defaultVelocity;
pData.timeAlive = 0;
pData.isAlive = u_Loop;
atomicCounterIncrement(deadParticles)
pVertex.position.x = pData.defaultPosition.x;
pVertex.position.y = pData.defaultPosition.y;
}
InVertex[id] = pVertex;
InData[id] = pData;
To copy data to the cpu i use following code
uint32_t* OpenGLAtomicCounter::GetCounters()
{
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, m_AC);
glGetBufferSubData(GL_ATOMIC_COUNTER_BUFFER, 0, sizeof(uint32_t) * m_NumberOfCounters, m_Counters);
return m_Counters;
}

Related

Compute Shader execution time between DirectX11 and OpenGL

I am studying Compute Shaders in DirectX and OpenGL
and I wrote some code to test compute shader and checked the execution time.
but there was some difference between DirectX execution time and Opengl's
and above image represent how much different it is (left is DirectX, right is Opengl, time represent nanoseconds)
even DirectX compute Shader is slower than cpu
here is my Code that calculates the both vector's sum
one for compute shader and one for cpu
std::vector<Data> dataA(32);
std::vector<Data> dataB(32);
for (int i = 0; i < 32; ++i)
{
dataA[i].v1 = glm::vec3(i, i, i);
dataA[i].v2 = glm::vec2(i, 0);
dataB[i].v1 = glm::vec3(-i, i, 0.0f);
dataB[i].v2 = glm::vec2(0, -i);
}
InputBufferA = ShaderBuffer::Create(sizeof(Data), 32, BufferType::Read, dataA.data());
InputBufferB = ShaderBuffer::Create(sizeof(Data), 32, BufferType::Read, dataB.data());
OutputBufferA =ShaderBuffer::Create(sizeof(Data), 32, BufferType::ReadWrite);
computeShader->Bind();
InputBufferA->Bind(0, ShaderType::CS);
InputBufferB->Bind(1, ShaderType::CS);
OutputBufferA->Bind(0,ShaderType::CS);
// Check The Compute Shader Calculation time
std::chrono::system_clock::time_point time1 = std::chrono::system_clock::now();
RenderCommand::DispatchCompute(1, 1, 1);
std::chrono::system_clock::time_point time2 = std::chrono::system_clock::now();
std::chrono::nanoseconds t =time2- time1;
QCAT_CORE_INFO("Compute Shader time : {0}", t.count());
// Check The Cpu Calculation time
std::vector<Data> dataC(32);
time1 = std::chrono::system_clock::now();
for (int i = 0; i < 32; ++i)
{
dataC[i].v1 = (dataA[i].v1 + dataB[i].v1);
dataC[i].v2 = (dataA[i].v2 + dataB[i].v2);
}
time2 = std::chrono::system_clock::now();
t = time2 - time1;
QCAT_CORE_INFO("CPU time : {0}", t.count() );
and here is glsl code
#version 450 core
struct Data
{
vec3 a;
vec2 b;
};
layout(std430,binding =0) readonly buffer Data1
{
Data input1[];
};
layout(std430,binding =1) readonly buffer Data2
{
Data input2[];
};
layout(std430,binding =2) writeonly buffer Data3
{
Data outputData[];
};
layout (local_size_x = 32, local_size_y = 1, local_size_z = 1) in;
void main()
{
uint index = gl_GlobalInvocationID.x;
outputData[index].a = input1[index].a + input2[index].a;
outputData[index].b = input1[index].b + input2[index].b;
}
and hlsl code
struct Data
{
float3 v1;
float2 v2;
};
StructuredBuffer<Data> gInputA : register(t0);
StructuredBuffer<Data> gInputB : register(t1);
RWStructuredBuffer<Data> gOutput : register(u0);
[numthreads(32,1,1)]
void CSMain(int3 dtid : SV_DispatchThreadID)
{
gOutput[dtid.x].v1 = gInputA[dtid.x].v1 + gInputB[dtid.x].v1;
gOutput[dtid.x].v2 = gInputA[dtid.x].v2 + gInputB[dtid.x].v2;
}
pretty simple code isnt it?
but Opengl's performance time is 10 times better than DirectX's time
i dont get it why this is happened is there anything slow the performance??
this is code that when i create RWStructuredBuffer only thing diffrence with StructuredBuffer is BindFlags = D3D11_BIND_SHADER_RESOURCE
desc.Usage = D3D11_USAGE_DEFAULT;
desc.ByteWidth = size * count;
desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
desc.CPUAccessFlags = 0;
desc.StructureByteStride = size;
desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
uavDesc.Format = DXGI_FORMAT_UNKNOWN;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uavDesc.Buffer.FirstElement = 0;
uavDesc.Buffer.Flags = 0;
uavDesc.Buffer.NumElements = count;
and in opengl i create SSBO like this way
glGenBuffers(1, &m_renderID);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, m_renderID);
glBufferData(GL_SHADER_STORAGE_BUFFER, int(size * count), pData, GL_STATIC_DRAW);
this is all code for Execute Compute Shader in both API
and every result show me opengl is better than directx
What properties makes that diffrence?
is in Buffer or ShaderCode?
So first, as mentioned in the comments, you are not measuring GPU execution time, but the time to record the command itself (the gpu will execute it later at some point, then it decides to flush commands).
In order to measure GPU execution time, you need to use Queries
In your case (Direct3D11, but similar for OpenGL), you need to create 3 queries :
2 must be of type D3D11_QUERY_TIMESTAMP (to measure start and end time)
1 must be of type D3D11_QUERY_TIMESTAMP_DISJOINT (the disjoint query will indicate that the timestamp results are not valid anymore, for example if the clock frequency of your gpu changes). The disjoint query will also give you the frenquency, which is needed to convert to milliseconds.
so to measure your gpu time (on device context, you the issue the following):
d3d11DeviceContext->Begin(yourDisjointQuery);
d3d11DeviceContext->Begin(yourFirstTimeStampQuery);
Dispatch call goes here
d3d11DeviceContext->Begin(yourSecondTimeStampQuery);
d3d11DeviceContext->Begin(yourDisjointQuery);
Note that timestamp queries are only calling begin, which is perfectly normal, you just ask the "gpu clock", to simplify.
Then you can call (order does not matter):
d3d11DeviceContext->GetData(yourDisjointQuery);
d3d11DeviceContext->GetData(yourSecondTimeStampQuery);
d3d11DeviceContext->GetData(yourFirstTimeStampQuery);
Check that the disjoint result is NOT disjoint, and get frequency from it:
double delta = end - start;
double frequency;
double ticks = delta / (freq / 10000000);
So now why does "just" recording that command takes a lot of time versus just doing the same calculation on the CPU.
You only perform a few addition on 32 elements, which is an extremely trivial and fast operation for a CPU.
If you start to increase element count then GPU will eventually take over.
First, if you have your D3D device created with DEBUG flag, remove the flag to profile. Some drivers (NVIDIA in particular) command recording perform very poorly with that flag.
Second, driver will perform quite a bunch of checks when you call Dispatch (check that resources are of the correct format, correct strides, resource still alive....). DirectX driver tends to do a lot of checks, so it might be slightly slower than GL one (but not by that magnitude, which leads to the last point).
Last, it is likely that the GPU/Driver does a warm up on your shader (some drivers convert the dx bytecode to their native counterpart asynchronously, so when you call
device->CreateComputeShader();
It might be done immediately or placed in a queue (AME does the queue thing, see this link Gpu Open Shader Compiler controls).
If you call Dispatch before this task is effectively processed, you might have a wait as well.
Also note that most GPU have a cache on disk nowadays, so the first compile/use might also impact performances.
So you should try to call Dispatch several times, and check if the CPU timings are different after the first call.

Compute to Graphics Dependencies

I am doing a Marching cube algorithm in a Compute shader. The vertices generated by the compute stage will be input to the vertex stage.
Compute -> Vertices -> Render
There is no way of knowing how many vertices that the compute stage will output, so I need a storage buffer looking something like this:
layout(set = 1, binding = 0) buffer Count{
int value;
} count;
layout(set = 2, binding = 0) buffer Mesh {
vec4 vertices[1<<15];
} mesh;
The vertices do not need a roundtrip to the CPU, but the count is a variable used by the vkCmdDraw command. So I need to put the count buffer in host visible memory, map that memory and do a memcpy after the compute stage. Is this a good way of solving this problem or is there some other way where I don't have to read back data to the CPU?
Well, this is exactly what vkCmdDrawIndirect is for. The vertex count is stored in a Vkuffer, which makes the CPU round-trip unnecessary.

Can you modify a uniform from within the shader? If so. how?

So I wanted to store all my meshes in one large VBO. The problem is, how do you do have just one draw call, but let every mesh have its own model to world matrix?
My idea was to submit an array of matrices to a uniform before drawing. In the VBO I would make the color of every first vertex of a mesh negative (So I'd be using the signing bit to check whether a vertex was the first of a mesh).
Okay, so I can detect when a new mesh has started and I have an array of matrices ready and probably a uniform called 'index'. But how do I increase this index by one every time I encounter a new mesh?
Can you modify a uniform from within the shader? If so, how?
Can you modify a uniform from within the shader?
If you could, it wouldn't be uniform anymore, would it?
Furthermore, what you're wanting to do cannot be done even with Image Load/Store or SSBOs, both of which allow shaders to write data. It won't work because vertex shader invocations are not required to be executed sequentially. Many happen at the same time, and there's no way for any shader invocation to know that it will happen "after" the "first vertex" in a mesh.
The simplest way to deal with this is the obvious solution. Render each mesh individually, but set the uniforms for each mesh before each draw call. Without changing buffers between draws, of course. Uniform changes, while not exactly cheap, aren't the most expensive state changes that exist.
There are more complicated drawing methods that could allow you more performance. But that form is adequate for most needs. You've already done the hard part: you removed the need for any state change (textures, buffers, vertex formats, etc) except uniform state.
There are two approaches to minimize draw calls - instancing and batching. The first (instancing) allows you to draw multiple copies of same meshes in one draw call, but it depends on the API (is available from OpenGL 3.1). Batching is similar to instancing but allows you to draw different meshes. Both of these approaches have restrictions - meshes should be with the same materials and shaders.
If you would to draw different meshes in one VBO then instancing is not an option. So, batching requires keeping all meshes in 'big' VBO with applied world transform. It not a problem with static meshes, but have some discomfort with animated. I give you some pseudocode with batching implementation
struct SGeometry
{
uint64_t offsetVB;
uint64_t offsetIB;
uint64_t sizeVB;
uint64_t sizeIB;
glm::mat4 oldTransform;
glm::mat4 transform;
}
std::vector<SGeometry> cachedGeometries;
...
void CommitInstances()
{
uint64_t vertexOffset = 0;
uint64_t indexOffset = 0;
for (auto instance in allInstances)
{
Copy(instance->Vertexes(), VBO);
for (uint64_t i = 0; i < instances->Indices().size(); ++i)
{
auto index = instances->Indices()[i];
index += indexOffset;
IBO[i] = index;
}
cachedGeometries.push_back({vertexOffset, indexOffset});
vertexOffset += instance->Vertexes().size();
indexOffset += instance->Indices().size();
}
Commit(VBO);
Commit(IBO);
}
void ApplyTransform(glm::mat4 modelMatrix, uint64_t instanceId)
{
const SGeometry& geom = cachedGeometries[i];
glm::mat4 inverseOldTransform = glm::inverse(geom.oldTransform);
VertexStream& stream = VBO->GetStream(Position, geom.offsetVB);
for (uint64_t i = 0; i < geom.sizeVB; ++i)
{
glm::vec3 pos = stream->Get(i);
// We need to revert absolute transformation before applying new
pos = glm::vec3(inverseOldNormalTransform * glm::vec4(pos, 1.0f));
pos = glm::vec3(normalTransform * glm::vec4(pos, 1.0f));
stream->Set(i);
}
// .. Apply normal transformation
}
GPU Gems 2 has a good article about geometry instancing http://www.amazon.com/GPU-Gems-Programming-High-Performance-General-Purpose/dp/0321335597

Sum of absolute difference of 2 geometries within a shader in unity

I am trying to do a Sum of absolute difference within my shader and write back the single result back to a uniform float in a in unity.
In the shader I have 2 geometries with the same number of vertices that map one to one.
// substract vertices
float norm = 10;
float error=infereCrater.vertex.y-v.vertex.y;
error = error*error*norm;
o.debugColor = float3(error,1-error ,0.0f);
//////
o.posWorld =mul(_Object2World,v.vertex);
o.normalWorld = normalize(mul(float4(v.normal,0.0),_World2Object).xyz);
o.tangentWorld = normalize(mul(float4(v.tangent,0.0),_World2Object).xyz);
o.binormalWorld = cross(o.normalWorld,o.tangentWorld);
o.tex = v.texcoord;
o.pos = mul(UNITY_MATRIX_MVP,v.vertex);
TRANSFER_VERTEX_TO_FRAGMENT(o);
return o;
}
I am available to calculate the error for each individual vertex and change the color of the surface based on the difference.
I hit a road block where I don't know how to sync all the threads and start adding up the values.
Is there a way to call another vertex shader after the first one is done?
How can the vertex shader read the values of adjacent vertex to it? (don't think its possible because in local memory of thread)
Or its possible to have a global array, to store the difference values, copy this to the CPU (which I don't want because of latency) and add them in the CPU?
I don't want to use compute shader because I am not in Windows

(DirectX 11) Dynamic Vertex/Index Buffers implementation with constant scene content changes

Been delving into un-managed DirectX 11 for the first time (bear with me) and there's an issue that, although asked several times over the forums still leaves me with questions.
I am developing as app in which objects are added to the scene over time. On each render loop I want to collect all vertices in the scene and render them reusing a single vertex and index buffer for performance and best practice. My question is regarding the usage of dynamic vertex and index buffers. I haven't been able to fully understand their correct usage when scene content changes.
vertexBufferDescription.Usage = D3D11_USAGE_DYNAMIC;
vertexBufferDescription.BindFlags = D3D11_BIND_VERTEX_BUFFER;
vertexBufferDescription.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
vertexBufferDescription.MiscFlags = 0;
vertexBufferDescription.StructureByteStride = 0;
Should I create the buffers when the scene is initialized and somehow update their content in every frame? If so, what ByteSize should I set in the buffer description? And what do I initialize it with?
Or, should I create it the first time the scene is rendered (frame 1) using the current vertex count as its size? If so, when I add another object to the scene, don't I need to recreate the buffer and changing the buffer description's ByteWidth to the new vertex count? If my scene keeps updating its vertices on each frame, the usage of a single dynamic buffer would loose its purpose this way...
I've been testing initializing the buffer on the first time the scene is rendered, and from there on, using Map/Unmap on each frame. I start by filling in a vector list with all the scene objects and then update the resource like so:
void Scene::Render()
{
(...)
std::vector<VERTEX> totalVertices;
std::vector<int> totalIndices;
int totalVertexCount = 0;
int totalIndexCount = 0;
for (shapeIterator = models.begin(); shapeIterator != models.end(); ++shapeIterator)
{
Model* currentModel = (*shapeIterator);
// totalVertices gets filled here...
}
// At this point totalVertices and totalIndices have all scene data
if (isVertexBufferSet)
{
// This is where it copies the new vertices to the buffer.
// but it's causing flickering in the entire screen...
D3D11_MAPPED_SUBRESOURCE resource;
context->Map(vertexBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &resource);
memcpy(resource.pData, &totalVertices[0], sizeof(totalVertices));
context->Unmap(vertexBuffer, 0);
}
else
{
// This is run in the first frame. But what if new vertices are added to the scene?
vertexBufferDescription.ByteWidth = sizeof(VERTEX) * totalVertexCount;
UINT stride = sizeof(VERTEX);
UINT offset = 0;
D3D11_SUBRESOURCE_DATA resourceData;
ZeroMemory(&resourceData, sizeof(resourceData));
resourceData.pSysMem = &totalVertices[0];
device->CreateBuffer(&vertexBufferDescription, &resourceData, &vertexBuffer);
context->IASetVertexBuffers(0, 1, &vertexBuffer, &stride, &offset);
isVertexBufferSet = true;
}
In the end of the render loop, while keeping track of the buffer position of the vertices for each object, I finally invoke Draw():
context->Draw(objectVertexCount, currentVertexOffset);
}
My current implementation is causing my whole scene to flicker. But no memory leaks. Wonder if it has anything to do with the way I am using the Map/Unmap API?
Also, in this scenario, when would it be ideal to invoke buffer->Release()?
Tips or code sample would be great! Thanks in advance!
At the memcpy into the vertex buffer you do the following:
memcpy(resource.pData, &totalVertices[0], sizeof(totalVertices));
sizeof( totalVertices ) is just asking for the size of a std::vector< VERTEX > which is not what you want.
Try the following code:
memcpy(resource.pData, &totalVertices[0], sizeof( VERTEX ) * totalVertices.size() );
Also you don't appear to calling IASetVertexBuffers when isVertexBufferSet is true. Make sure you do so.