Compute Shader execution time between DirectX11 and OpenGL - c++

I am studying Compute Shaders in DirectX and OpenGL
and I wrote some code to test compute shader and checked the execution time.
but there was some difference between DirectX execution time and Opengl's
and above image represent how much different it is (left is DirectX, right is Opengl, time represent nanoseconds)
even DirectX compute Shader is slower than cpu
here is my Code that calculates the both vector's sum
one for compute shader and one for cpu
std::vector<Data> dataA(32);
std::vector<Data> dataB(32);
for (int i = 0; i < 32; ++i)
{
dataA[i].v1 = glm::vec3(i, i, i);
dataA[i].v2 = glm::vec2(i, 0);
dataB[i].v1 = glm::vec3(-i, i, 0.0f);
dataB[i].v2 = glm::vec2(0, -i);
}
InputBufferA = ShaderBuffer::Create(sizeof(Data), 32, BufferType::Read, dataA.data());
InputBufferB = ShaderBuffer::Create(sizeof(Data), 32, BufferType::Read, dataB.data());
OutputBufferA =ShaderBuffer::Create(sizeof(Data), 32, BufferType::ReadWrite);
computeShader->Bind();
InputBufferA->Bind(0, ShaderType::CS);
InputBufferB->Bind(1, ShaderType::CS);
OutputBufferA->Bind(0,ShaderType::CS);
// Check The Compute Shader Calculation time
std::chrono::system_clock::time_point time1 = std::chrono::system_clock::now();
RenderCommand::DispatchCompute(1, 1, 1);
std::chrono::system_clock::time_point time2 = std::chrono::system_clock::now();
std::chrono::nanoseconds t =time2- time1;
QCAT_CORE_INFO("Compute Shader time : {0}", t.count());
// Check The Cpu Calculation time
std::vector<Data> dataC(32);
time1 = std::chrono::system_clock::now();
for (int i = 0; i < 32; ++i)
{
dataC[i].v1 = (dataA[i].v1 + dataB[i].v1);
dataC[i].v2 = (dataA[i].v2 + dataB[i].v2);
}
time2 = std::chrono::system_clock::now();
t = time2 - time1;
QCAT_CORE_INFO("CPU time : {0}", t.count() );
and here is glsl code
#version 450 core
struct Data
{
vec3 a;
vec2 b;
};
layout(std430,binding =0) readonly buffer Data1
{
Data input1[];
};
layout(std430,binding =1) readonly buffer Data2
{
Data input2[];
};
layout(std430,binding =2) writeonly buffer Data3
{
Data outputData[];
};
layout (local_size_x = 32, local_size_y = 1, local_size_z = 1) in;
void main()
{
uint index = gl_GlobalInvocationID.x;
outputData[index].a = input1[index].a + input2[index].a;
outputData[index].b = input1[index].b + input2[index].b;
}
and hlsl code
struct Data
{
float3 v1;
float2 v2;
};
StructuredBuffer<Data> gInputA : register(t0);
StructuredBuffer<Data> gInputB : register(t1);
RWStructuredBuffer<Data> gOutput : register(u0);
[numthreads(32,1,1)]
void CSMain(int3 dtid : SV_DispatchThreadID)
{
gOutput[dtid.x].v1 = gInputA[dtid.x].v1 + gInputB[dtid.x].v1;
gOutput[dtid.x].v2 = gInputA[dtid.x].v2 + gInputB[dtid.x].v2;
}
pretty simple code isnt it?
but Opengl's performance time is 10 times better than DirectX's time
i dont get it why this is happened is there anything slow the performance??
this is code that when i create RWStructuredBuffer only thing diffrence with StructuredBuffer is BindFlags = D3D11_BIND_SHADER_RESOURCE
desc.Usage = D3D11_USAGE_DEFAULT;
desc.ByteWidth = size * count;
desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
desc.CPUAccessFlags = 0;
desc.StructureByteStride = size;
desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
uavDesc.Format = DXGI_FORMAT_UNKNOWN;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uavDesc.Buffer.FirstElement = 0;
uavDesc.Buffer.Flags = 0;
uavDesc.Buffer.NumElements = count;
and in opengl i create SSBO like this way
glGenBuffers(1, &m_renderID);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, m_renderID);
glBufferData(GL_SHADER_STORAGE_BUFFER, int(size * count), pData, GL_STATIC_DRAW);
this is all code for Execute Compute Shader in both API
and every result show me opengl is better than directx
What properties makes that diffrence?
is in Buffer or ShaderCode?

So first, as mentioned in the comments, you are not measuring GPU execution time, but the time to record the command itself (the gpu will execute it later at some point, then it decides to flush commands).
In order to measure GPU execution time, you need to use Queries
In your case (Direct3D11, but similar for OpenGL), you need to create 3 queries :
2 must be of type D3D11_QUERY_TIMESTAMP (to measure start and end time)
1 must be of type D3D11_QUERY_TIMESTAMP_DISJOINT (the disjoint query will indicate that the timestamp results are not valid anymore, for example if the clock frequency of your gpu changes). The disjoint query will also give you the frenquency, which is needed to convert to milliseconds.
so to measure your gpu time (on device context, you the issue the following):
d3d11DeviceContext->Begin(yourDisjointQuery);
d3d11DeviceContext->Begin(yourFirstTimeStampQuery);
Dispatch call goes here
d3d11DeviceContext->Begin(yourSecondTimeStampQuery);
d3d11DeviceContext->Begin(yourDisjointQuery);
Note that timestamp queries are only calling begin, which is perfectly normal, you just ask the "gpu clock", to simplify.
Then you can call (order does not matter):
d3d11DeviceContext->GetData(yourDisjointQuery);
d3d11DeviceContext->GetData(yourSecondTimeStampQuery);
d3d11DeviceContext->GetData(yourFirstTimeStampQuery);
Check that the disjoint result is NOT disjoint, and get frequency from it:
double delta = end - start;
double frequency;
double ticks = delta / (freq / 10000000);
So now why does "just" recording that command takes a lot of time versus just doing the same calculation on the CPU.
You only perform a few addition on 32 elements, which is an extremely trivial and fast operation for a CPU.
If you start to increase element count then GPU will eventually take over.
First, if you have your D3D device created with DEBUG flag, remove the flag to profile. Some drivers (NVIDIA in particular) command recording perform very poorly with that flag.
Second, driver will perform quite a bunch of checks when you call Dispatch (check that resources are of the correct format, correct strides, resource still alive....). DirectX driver tends to do a lot of checks, so it might be slightly slower than GL one (but not by that magnitude, which leads to the last point).
Last, it is likely that the GPU/Driver does a warm up on your shader (some drivers convert the dx bytecode to their native counterpart asynchronously, so when you call
device->CreateComputeShader();
It might be done immediately or placed in a queue (AME does the queue thing, see this link Gpu Open Shader Compiler controls).
If you call Dispatch before this task is effectively processed, you might have a wait as well.
Also note that most GPU have a cache on disk nowadays, so the first compile/use might also impact performances.
So you should try to call Dispatch several times, and check if the CPU timings are different after the first call.

Related

GL_DYNAMIC_DRAW performance spikes

I'm working on my own game engine in C++ with OpenGL, and currently working on the UI. Because I wanted the UI to be scale nicely with different screens, every time there's a window resize it subs in some vertex data to the buffer so the width is consistent.
void UICanvas::windowResize(int width, int height) {
for (std::vector<int>::iterator itr = wIndices.begin(); itr != wIndices.end(); itr++) {
UIWindow& window = uiWindows.at(*itr);
float newX = window.size.x / (100.0f * (*aspectRatio));
float newY = window.size.y / 100.0f;
float newPositions = {
newX, 0.0f,
newX, -newY
};
glBindBuffer(GL_ARRAY_BUFFER, window.positions_vbo);
std::chrono::high_resolution_clock::timepoint start, end;
for (int i = 0; i < 1000; i++) { //For benchmarking purposes
start = std::chrono::high_resolution_clock::now();
glBufferSubData(GL_ARRAY_BUFFER, 16, 16, &newPositions[0]);
end = std::chrono::high_resolution_clock::now();
//The duration of glBufferSubData is all I care about
printEvent("windowResize", start, end); //Outputs to json for profiling
}
}
}
Knowing it would be updated occasionally, I knew I should probably use GL_DYNAMIC_DRAW. However, instead of just listening to people I wanted to benchmark the difference myself. For GL_DYNAMIC_DRAW, most of the resize function calls were 3-5 microseconds, with steady occasional spikes of 180-200 microseconds, and even more sparse spikes up to 4 milliseconds, 1000x the normal value. GL_STATIC_DRAW, while the average was 7-13 ms, only spiked up to 100-150 microseconds, but almost rarely. I know there are thousands of factors that can affect benchmarking, like the CPU caching values, or optimizing certain things, but it would almost make more sense if STATIC_DRAW followed the same giant spike pattern. I think I read somewhere that dynamic buffers are stored in VRAM and static ones in regular RAM, but would that be the reason for this behavior? If I use DYNAMIC_DRAW, will it have occasional spikes, or is it due to the repeated calls?
EDIT: Apologies if this was confusing, but the for loop in the code is my testing method, I am looping the same exact function over and over again, and timing only the buffer sub data, as that's where I'm looking for the difference between STATIC_DRAW and DYNAMIC_DRAW.

Lower framerate while using dedicated graphics card for OpenGL rendering

I'm using glDrawArraysInstanced to draw 10000 instances of a simple shape composed of 8 triangles.
On changing the dedicated graphics card that is to be used to my NVIDIA GTX 1060, it seems i'm getting lower framerate and also some visible stuttering.
This is the code i'm using to see time taken for each frame :
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
float i = (float)(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()) / 1000000.0;
while (!glfwWindowShouldClose(window)){
end = std::chrono::steady_clock::now();
i = (float)(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()) / 1000000.0;
std::cout << i << "\n";
begin = end; //Edit
//Other code for draw calls and to set uniforms.
}
Is this the wrong way to measure time elapsed per frame? If not, why is there a drop in performance?
Here is the comparison of the output :
Comparison Image
Updated Comparision Image
Edit :
Fragment Shader simply sets color for each fragment directly.
Vertex Shader Code :
#version 450 core
in vec3 vertex;
out vec3 outVertex;
uniform mat4 mv_matrix;
uniform mat4 proj_matrix;
uniform float time;
const float vel = 1.0;
float PHI = 1.61803398874989484820459;
float noise(in vec2 xy, in float seed) {
return fract(tan(distance(xy * PHI, xy) * seed) * xy.x);
}
void main() {
float y_coord = noise(vec2(-500 + gl_InstanceID / 100, -500 + gl_InstanceID % 100),20) * 40 + vel * time;
y_coord = mod(y_coord, 40)-20;
mat4 translationMatrix = mat4(vec4(1,0,0, 0 ),vec4(0,1,0, 0 ),vec4(0,0,1, 0 ),vec4(-50 + gl_InstanceID/100, y_coord, -50 + gl_InstanceID%100,1));
gl_Position = proj_matrix * mv_matrix * translationMatrix*vec4(vertex, 1);
outVertex = vertex;
}
I'm changing the card used by Visual Studio for rendering here :
extern "C" {
_declspec(dllexport) DWORD NvOptimusEnablement = 0x00000001;
}
Output is same for both and is shown here :
Output
Desired output is increased frame-rate while using dedicated GPU card to render, that is smaller time gaps between the rows in the Comparison image attached.
For Intel Integrated Card, it takes <0.01 seconds to render 1 frame.
For Dedicated GPU GTX 1060, it takes ~0.2 seconds to render 1 frame.
I solved the issues by Disabling NVIDIA Physx GPU acceleration. For some reason it slows down graphic rendering. Now I'm getting about ~280 FPS on my GPU even when rendering ~100k instances.
Your output clearly shows the times monotonically increasing, rather than jittering around some mean value. The reason for this is that your code is measuring total elapsed time, not per-frame time. To make it measure per-frame time insstead, you need a begin = end call at the end of your loop, so that the reference point for each frame is the end of the preceding frame, rather then the start time of the whole program.

How to count dead particles in the compute shader?

I am working on particle system. For calculation of each particle position, time alive and so on I use compute shader. I have problem to get count of dead particles back to the cpu, so I can set how many particles renderer should render. To store data of particles i use shader storage buffer. To render particles i use instancing. I tried to use atomic buffer counter, it works fine, but it is slow to copy data from buffer to the cpu. I wonder if there is some other option.
This is important part of compute shader
if (pData.timeAlive >= u_LifeTime)
{
pData.velocity = pData.defaultVelocity;
pData.timeAlive = 0;
pData.isAlive = u_Loop;
atomicCounterIncrement(deadParticles)
pVertex.position.x = pData.defaultPosition.x;
pVertex.position.y = pData.defaultPosition.y;
}
InVertex[id] = pVertex;
InData[id] = pData;
To copy data to the cpu i use following code
uint32_t* OpenGLAtomicCounter::GetCounters()
{
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, m_AC);
glGetBufferSubData(GL_ATOMIC_COUNTER_BUFFER, 0, sizeof(uint32_t) * m_NumberOfCounters, m_Counters);
return m_Counters;
}

Is there a way update a texture with out using a staging buffer?

I'm working with the https://vulkan-tutorial.com/ Depth Buffering code as a base. Made a few changes to update the command buffer every frame.
I'm using a crude way of checking fps. Not sure how accurate it really is, but I'm using this check to the fps.
static auto startTime = std::chrono::high_resolution_clock::now();
auto currentTime = std::chrono::high_resolution_clock::now();
float time = std::chrono::duration<float, std::chrono::seconds::period>(currentTime - startTime).count();
if (time < 1)
{
counter++;
}
else
{
int a = 34; //breakpoint put here to check the counter fps.
}
Any way without the texture per frame(The command buffer is still being updated per frame.) the fps is around 3500 fps. If I try to update the texture per frame the fps goes down to 350ish fps.
This is just test code with a blank texture, but this is process I'm using upload the texture the first time and to update it.
void createTextureImage()
{
int Width = 1024;
int Height = 1024;
VkDeviceSize imageSize = Width * Height * sizeof(Pixel);
PixelImage.resize(Width * Height, Pixel(0xFF, 0x00, 0x00));
VkBuffer stagingBuffer;
VkDeviceMemory stagingBufferMemory;
createBuffer(imageSize, VK_BUFFER_USAGE_TRANSFER_SRC_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, stagingBuffer, stagingBufferMemory);
void* data;
vkMapMemory(device, stagingBufferMemory, 0, imageSize, 0, &data);
memcpy(data, PixelImage.data(), static_cast<size_t>(imageSize));
vkUnmapMemory(device, stagingBufferMemory);
createImage(Width, Height, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_TILING_OPTIMAL, VK_IMAGE_USAGE_TRANSFER_DST_BIT | VK_IMAGE_USAGE_SAMPLED_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, textureImage, textureImageMemory);
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_UNDEFINED, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL);
copyBufferToImage(stagingBuffer, textureImage, static_cast<uint32_t>(Width), static_cast<uint32_t>(Height));
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL);
vkDestroyBuffer(device, stagingBuffer, nullptr);
vkFreeMemory(device, stagingBufferMemory, nullptr);
}
void UpdateTexture()
{
VkDeviceSize imageSize = 1024 * 1024 * sizeof(Pixel);
memset(&PixelImage[0], 0xFF, imageSize);
VkBuffer stagingBuffer;
VkDeviceMemory stagingBufferMemory;
createBuffer(imageSize, VK_BUFFER_USAGE_TRANSFER_SRC_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, stagingBuffer, stagingBufferMemory);
void* data;
vkMapMemory(device, stagingBufferMemory, 0, imageSize, 0, &data);
memcpy(data, PixelImage.data(), static_cast<size_t>(imageSize));
vkUnmapMemory(device, stagingBufferMemory);
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_UNDEFINED, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL);
copyBufferToImage(stagingBuffer, textureImage, static_cast<uint32_t>(1024), static_cast<uint32_t>(1024));
transitionImageLayout(textureImage, VK_FORMAT_R8G8B8A8_SRGB, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL);
vkDestroyBuffer(device, stagingBuffer, nullptr);
vkFreeMemory(device, stagingBufferMemory, nullptr);
vkDestroyImageView(device, textureImageView, nullptr);
CreateImageView();
}
I've been playing around with it a little, and it seems that all writing to the buffer and transitioning the layout multiple times what's really slowing things down.
For a bit more of context this the rest of the update texture process.
UpdateTexture();
for (size_t i = 0; i < vulkanFrame.size(); i++)
{
VkDescriptorBufferInfo bufferInfo = {};
bufferInfo.buffer = uniformBuffers[i];
bufferInfo.offset = 0;
bufferInfo.range = sizeof(UniformBufferObject);
VkDescriptorImageInfo imageInfo = {};
imageInfo.imageLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
imageInfo.imageView = textureImageView;
imageInfo.sampler = textureSampler;
std::array<VkWriteDescriptorSet, 2> descriptorWrites = {};
descriptorWrites[0].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
descriptorWrites[0].dstSet = descriptorSets[i];
descriptorWrites[0].dstBinding = 0;
descriptorWrites[0].dstArrayElement = 0;
descriptorWrites[0].descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
descriptorWrites[0].descriptorCount = 1;
descriptorWrites[0].pBufferInfo = &bufferInfo;
descriptorWrites[1].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
descriptorWrites[1].dstSet = descriptorSets[i];
descriptorWrites[1].dstBinding = 1;
descriptorWrites[1].dstArrayElement = 0;
descriptorWrites[1].descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;
descriptorWrites[1].descriptorCount = 1;
descriptorWrites[1].pImageInfo = &imageInfo;
vkUpdateDescriptorSets(device, static_cast<uint32_t>(descriptorWrites.size()), descriptorWrites.data(), 0, nullptr);
}
Also what's a good base fps to have for blank updating screen for a 2d game. I'm also using vulkan for 3d, but I also want to do retro 2d stuff with it too.
You're sending 4MB of data from the CPU to the GPU every frame. At 350 fps, that's ~1.4GB/sec of data transfer speed. That's pretty decent, all things considered.
The staging buffer isn't really the problem. Once you decide that you're going to be sending data from the CPU to the GPU, then you've forfeited some quantity of performance.
If you're really insistent on avoiding staging, you could check to see if your implementation allows linear textures to be sampled from by the shader. In that case, you can write data directly into the texture's memory. However, you'd need to double-buffer your textures, so that you're not writing to a texture that is currently in use by the GPU. But you'd need to do that anyway even with staging.
Something more effective you could do is stop doing pointless things. You need to stop:
Allocating and freeing the space for your staging buffer on every upload. Create enough staging memory and buffer space at the start of your application and just keep it around.
Unmapping the memory; there's pretty much no point to that in Vulkan unless you're about to delete said memory. Which again is not something you ought to be doing.
Submitting a transfer operation the moment you finish building it. I don't see your CB/queue work, so I imagine that transitionImageLayout and copyBufferToImage are not merely building CB information but also submitting it. That's killing performance (especially if transitionImageLayout also submits work). You want to have as few submits per-frame as possible, ideally only one per queue that you actually use.
All of these things hurt the CPU performance of your code. They don't change the actual time of the GPU transfer, but they make the code that causes that transfer run much slower.

Updating just parts of a large OpenGL VBO at run-time without latency

I am trying to update a large VBO in OpenGL which has about 4,000,000 floats in it, but I only need to update the elements which are changing (<1%), and this needs to happen at run time.
I have pre-computed the indices which need changing, but because they are fragmented throughout the VBO I have to send 1000 individual glBufferSubDataARB calls with the appropriate index offsets (i'm not sure if this is a problem or not)
I have set the VBO to use STREAM_DRAW_ARB because the update to the VBO occurs every 5 seconds.
Even if I update just 1000 of the objects in the VBO (so about 16,000 floats spread over 1000 calls) I notice a small but noticable latency.
I believe this maybe due to the VBO being used for drawing whilst it is being updated as I've heard this can result in latency. I only know of solutions to this problem when you are updating the entire VBO - for example: OpenGL VBO updating data
However, because my VBO is so large, I would think sending 4,000,000 data elements every 5 seconds would be a lot slower and use up a lot of the CPU-GPU bandwidth. So I was wondering if anybody knows how to avoid the VBO waiting for the GPU to finish for it to be updated doing it the way I am - fragmented over a VBO, updated over about a thousand calls.
Anyway, the following is a section of my code which updates the buffer every 5 seconds with usually around 16,000 floats of the 4,000,000 present (but as I say, using about 1000 calls).
for(unsigned int kkkk = 0;kkkk < surf_props.quadrant_indices[0].size();kkkk++)
{
temp_surf_base_colour[0] = surf_props.quadrant_brightness[0][kkkk];
temp_surf_base_colour[1] = 1.0;
temp_surf_base_colour[2] = surf_props.quadrant_brightness[0][kkkk];
temp_surf_base_colour[3] = 1.0;
temp_surf_base_colour[4] = surf_props.quadrant_brightness[0][kkkk];
temp_surf_base_colour[5] = 1.0;
temp_surf_base_colour[6] = surf_props.quadrant_brightness[0][kkkk];
temp_surf_base_colour[7] = 1.0;
temp_surf_base_colour[8] = surf_props.quadrant_brightness[0][kkkk];
temp_surf_base_colour[9] = 1.0;
temp_surf_base_colour[10] = surf_props.quadrant_brightness[0][kkkk];
temp_surf_base_colour[11] = 1.0;
temp_surf_base_colour[12] = surf_props.quadrant_brightness[0][kkkk];
temp_surf_base_colour[13] = 1.0;
temp_surf_base_colour[14] = surf_props.quadrant_brightness[0][kkkk];
temp_surf_base_colour[15] = 1.0;
glBindBufferARB(GL_ARRAY_BUFFER_ARB, vb_colour_surf);
glBufferSubDataARB(GL_ARRAY_BUFFER_ARB, sizeof(GLfloat) * ((numb_surf_prims * 4) + surf_props.quadrant_indices[0][kkkk] * 16)), sizeof(GLfloat) * 16, temp_surf_base_colour);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
}