I'm using glDrawArraysInstanced to draw 10000 instances of a simple shape composed of 8 triangles.
On changing the dedicated graphics card that is to be used to my NVIDIA GTX 1060, it seems i'm getting lower framerate and also some visible stuttering.
This is the code i'm using to see time taken for each frame :
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
float i = (float)(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()) / 1000000.0;
while (!glfwWindowShouldClose(window)){
end = std::chrono::steady_clock::now();
i = (float)(std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()) / 1000000.0;
std::cout << i << "\n";
begin = end; //Edit
//Other code for draw calls and to set uniforms.
}
Is this the wrong way to measure time elapsed per frame? If not, why is there a drop in performance?
Here is the comparison of the output :
Comparison Image
Updated Comparision Image
Edit :
Fragment Shader simply sets color for each fragment directly.
Vertex Shader Code :
#version 450 core
in vec3 vertex;
out vec3 outVertex;
uniform mat4 mv_matrix;
uniform mat4 proj_matrix;
uniform float time;
const float vel = 1.0;
float PHI = 1.61803398874989484820459;
float noise(in vec2 xy, in float seed) {
return fract(tan(distance(xy * PHI, xy) * seed) * xy.x);
}
void main() {
float y_coord = noise(vec2(-500 + gl_InstanceID / 100, -500 + gl_InstanceID % 100),20) * 40 + vel * time;
y_coord = mod(y_coord, 40)-20;
mat4 translationMatrix = mat4(vec4(1,0,0, 0 ),vec4(0,1,0, 0 ),vec4(0,0,1, 0 ),vec4(-50 + gl_InstanceID/100, y_coord, -50 + gl_InstanceID%100,1));
gl_Position = proj_matrix * mv_matrix * translationMatrix*vec4(vertex, 1);
outVertex = vertex;
}
I'm changing the card used by Visual Studio for rendering here :
extern "C" {
_declspec(dllexport) DWORD NvOptimusEnablement = 0x00000001;
}
Output is same for both and is shown here :
Output
Desired output is increased frame-rate while using dedicated GPU card to render, that is smaller time gaps between the rows in the Comparison image attached.
For Intel Integrated Card, it takes <0.01 seconds to render 1 frame.
For Dedicated GPU GTX 1060, it takes ~0.2 seconds to render 1 frame.
I solved the issues by Disabling NVIDIA Physx GPU acceleration. For some reason it slows down graphic rendering. Now I'm getting about ~280 FPS on my GPU even when rendering ~100k instances.
Your output clearly shows the times monotonically increasing, rather than jittering around some mean value. The reason for this is that your code is measuring total elapsed time, not per-frame time. To make it measure per-frame time insstead, you need a begin = end call at the end of your loop, so that the reference point for each frame is the end of the preceding frame, rather then the start time of the whole program.
Related
So first off, let me say that while the code works perfectly well from a visual point of view, it runs into very steep performance issues that get progressively worse as you add more lights. In its current form it's good as a proof of concept, or a tech demo, but is otherwise unusable.
Long story short, I'm writing a RimWorld-style game with real-time top-down 2D lighting. The way I implemented rendering is with a 3 layered technique as follows:
First I render occlusions to a single-channel R8 occlusion texture mapped to a framebuffer. This part is lightning fast and doesn't slow down with more lights, so it's not part of the problem:
Then I invoke my lighting shader by drawing a huge rectangle over my lightmap texture mapped to another framebuffer. The light data is stored in an array in an UBO and it uses the occlusion mapping in its calculations. This is where the slowdown happens:
And lastly, the lightmap texture is multiplied and added to the regular world renderer, this also isn't affected by the number of lights, so it's not part of the problem:
The problem is thus in the lightmap shader. The first iteration had many branches which froze my graphics driver right away when I first tried it, but after removing most of them I get a solid 144 fps at 1440p with 3 lights, and ~58 fps at 1440p with 20 lights. An improvement, but it scales very poorly. The shader code is as follows, with additional annotations:
#version 460 core
// per-light data
struct Light
{
vec4 location;
vec4 rangeAndstartColor;
};
const int MaxLightsCount = 16; // I've also tried 8 and 32, there was no real difference
layout(std140) uniform ubo_lights
{
Light lights[MaxLightsCount];
};
uniform sampler2D occlusionSampler; // the occlusion texture sampler
in vec2 fs_tex0; // the uv position in the large rectangle
in vec2 fs_window_size; // the window size to transform world coords to view coords and back
out vec4 color;
void main()
{
vec3 resultColor = vec3(0.0);
const vec2 size = fs_window_size;
const vec2 pos = (size - vec2(1.0)) * fs_tex0;
// process every light individually and add the resulting colors together
// this should be branchless, is there any way to check?
for(int idx = 0; idx < MaxLightsCount; ++idx)
{
const float range = lights[idx].rangeAndstartColor.x;
const vec2 lightPosition = lights[idx].location.xy;
const float dist = length(lightPosition - pos); // distance from current fragment to current light
// early abort, the next part is expensive
// this branch HAS to be important, right? otherwise it will check crazy long lines against occlusions
if(dist > range)
continue;
const vec3 startColor = lights[idx].rangeAndstartColor.yzw;
// walk between pos and lightPosition to find occlusions
// standard line DDA algorithm
vec2 tempPos = pos;
int lineSteps = int(ceil(abs(lightPosition.x - pos.x) > abs(lightPosition.y - pos.y) ? abs(lightPosition.x - pos.x) : abs(lightPosition.y - pos.y)));
const vec2 lineInc = (lightPosition - pos) / lineSteps;
// can I get rid of this loop somehow? I need to check each position between
// my fragment and the light position for occlusions, and this is the best I
// came up with
float lightStrength = 1.0;
while(lineSteps --> 0)
{
const vec2 nextPos = tempPos + lineInc;
const vec2 occlusionSamplerUV = tempPos / size;
lightStrength *= 1.0 - texture(occlusionSampler, vec2(occlusionSamplerUV.x, 1 - occlusionSamplerUV.y)).x;
tempPos = nextPos;
}
// the contribution of this light to the fragment color is based on
// its square distance from the light, and the occlusions between them
// implemented as multiplications
const float strength = max(0, range - dist) / range * lightStrength;
resultColor += startColor * strength * strength;
}
color = vec4(resultColor, 1.0);
}
I call this shader as many times as I need, since the results are additive. It works with large batches of lights or one by one. Performance-wise, I didn't notice any real change trying different batch numbers, which is perhaps a bit odd.
So my question is, is there a better way to look up for any (boolean) occlusions between my fragment position and light position in the occlusion texture, without iterating through every pixel by hand? Could render buffers perhaps help here (from what I've read they're for reading data back to system memory, I need it in another shader though)?
And perhaps, is there a better algorithm for what I'm doing here?
I can think of a couple routes for optimization:
Exact: apply a distance transform on the occlusion map: this will give you the distance to the nearest occluder at each pixel. After that you can safely step by that distance within the loop, instead of doing baby steps. This will drastically reduce the number of steps in open regions.
There is a very simple CPU-side algorithm to compute a DT, and it may suit you if your occluders are static. If your scene changes every frame, however, you'll need to search the literature for GPU side algorithms, which seem to be more complicated.
Inexact: resort to soft shadows -- it might be a compromise you are willing to make, and even seen as an artistic choice. If you are OK with that, you can create a mipmap from your occlusion map, and then progressively increase the step and sample lower levels as you go farther from the point you are shading.
You can go further and build an emitters map (into the same 4-channel map as the occlusion). Then your entire shading pass will be independent of the number of lights. This is an equivalent of voxel cone tracing GI applied to 2D.
I am studying Compute Shaders in DirectX and OpenGL
and I wrote some code to test compute shader and checked the execution time.
but there was some difference between DirectX execution time and Opengl's
and above image represent how much different it is (left is DirectX, right is Opengl, time represent nanoseconds)
even DirectX compute Shader is slower than cpu
here is my Code that calculates the both vector's sum
one for compute shader and one for cpu
std::vector<Data> dataA(32);
std::vector<Data> dataB(32);
for (int i = 0; i < 32; ++i)
{
dataA[i].v1 = glm::vec3(i, i, i);
dataA[i].v2 = glm::vec2(i, 0);
dataB[i].v1 = glm::vec3(-i, i, 0.0f);
dataB[i].v2 = glm::vec2(0, -i);
}
InputBufferA = ShaderBuffer::Create(sizeof(Data), 32, BufferType::Read, dataA.data());
InputBufferB = ShaderBuffer::Create(sizeof(Data), 32, BufferType::Read, dataB.data());
OutputBufferA =ShaderBuffer::Create(sizeof(Data), 32, BufferType::ReadWrite);
computeShader->Bind();
InputBufferA->Bind(0, ShaderType::CS);
InputBufferB->Bind(1, ShaderType::CS);
OutputBufferA->Bind(0,ShaderType::CS);
// Check The Compute Shader Calculation time
std::chrono::system_clock::time_point time1 = std::chrono::system_clock::now();
RenderCommand::DispatchCompute(1, 1, 1);
std::chrono::system_clock::time_point time2 = std::chrono::system_clock::now();
std::chrono::nanoseconds t =time2- time1;
QCAT_CORE_INFO("Compute Shader time : {0}", t.count());
// Check The Cpu Calculation time
std::vector<Data> dataC(32);
time1 = std::chrono::system_clock::now();
for (int i = 0; i < 32; ++i)
{
dataC[i].v1 = (dataA[i].v1 + dataB[i].v1);
dataC[i].v2 = (dataA[i].v2 + dataB[i].v2);
}
time2 = std::chrono::system_clock::now();
t = time2 - time1;
QCAT_CORE_INFO("CPU time : {0}", t.count() );
and here is glsl code
#version 450 core
struct Data
{
vec3 a;
vec2 b;
};
layout(std430,binding =0) readonly buffer Data1
{
Data input1[];
};
layout(std430,binding =1) readonly buffer Data2
{
Data input2[];
};
layout(std430,binding =2) writeonly buffer Data3
{
Data outputData[];
};
layout (local_size_x = 32, local_size_y = 1, local_size_z = 1) in;
void main()
{
uint index = gl_GlobalInvocationID.x;
outputData[index].a = input1[index].a + input2[index].a;
outputData[index].b = input1[index].b + input2[index].b;
}
and hlsl code
struct Data
{
float3 v1;
float2 v2;
};
StructuredBuffer<Data> gInputA : register(t0);
StructuredBuffer<Data> gInputB : register(t1);
RWStructuredBuffer<Data> gOutput : register(u0);
[numthreads(32,1,1)]
void CSMain(int3 dtid : SV_DispatchThreadID)
{
gOutput[dtid.x].v1 = gInputA[dtid.x].v1 + gInputB[dtid.x].v1;
gOutput[dtid.x].v2 = gInputA[dtid.x].v2 + gInputB[dtid.x].v2;
}
pretty simple code isnt it?
but Opengl's performance time is 10 times better than DirectX's time
i dont get it why this is happened is there anything slow the performance??
this is code that when i create RWStructuredBuffer only thing diffrence with StructuredBuffer is BindFlags = D3D11_BIND_SHADER_RESOURCE
desc.Usage = D3D11_USAGE_DEFAULT;
desc.ByteWidth = size * count;
desc.BindFlags = D3D11_BIND_UNORDERED_ACCESS;
desc.CPUAccessFlags = 0;
desc.StructureByteStride = size;
desc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
D3D11_UNORDERED_ACCESS_VIEW_DESC uavDesc;
uavDesc.Format = DXGI_FORMAT_UNKNOWN;
uavDesc.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;
uavDesc.Buffer.FirstElement = 0;
uavDesc.Buffer.Flags = 0;
uavDesc.Buffer.NumElements = count;
and in opengl i create SSBO like this way
glGenBuffers(1, &m_renderID);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, m_renderID);
glBufferData(GL_SHADER_STORAGE_BUFFER, int(size * count), pData, GL_STATIC_DRAW);
this is all code for Execute Compute Shader in both API
and every result show me opengl is better than directx
What properties makes that diffrence?
is in Buffer or ShaderCode?
So first, as mentioned in the comments, you are not measuring GPU execution time, but the time to record the command itself (the gpu will execute it later at some point, then it decides to flush commands).
In order to measure GPU execution time, you need to use Queries
In your case (Direct3D11, but similar for OpenGL), you need to create 3 queries :
2 must be of type D3D11_QUERY_TIMESTAMP (to measure start and end time)
1 must be of type D3D11_QUERY_TIMESTAMP_DISJOINT (the disjoint query will indicate that the timestamp results are not valid anymore, for example if the clock frequency of your gpu changes). The disjoint query will also give you the frenquency, which is needed to convert to milliseconds.
so to measure your gpu time (on device context, you the issue the following):
d3d11DeviceContext->Begin(yourDisjointQuery);
d3d11DeviceContext->Begin(yourFirstTimeStampQuery);
Dispatch call goes here
d3d11DeviceContext->Begin(yourSecondTimeStampQuery);
d3d11DeviceContext->Begin(yourDisjointQuery);
Note that timestamp queries are only calling begin, which is perfectly normal, you just ask the "gpu clock", to simplify.
Then you can call (order does not matter):
d3d11DeviceContext->GetData(yourDisjointQuery);
d3d11DeviceContext->GetData(yourSecondTimeStampQuery);
d3d11DeviceContext->GetData(yourFirstTimeStampQuery);
Check that the disjoint result is NOT disjoint, and get frequency from it:
double delta = end - start;
double frequency;
double ticks = delta / (freq / 10000000);
So now why does "just" recording that command takes a lot of time versus just doing the same calculation on the CPU.
You only perform a few addition on 32 elements, which is an extremely trivial and fast operation for a CPU.
If you start to increase element count then GPU will eventually take over.
First, if you have your D3D device created with DEBUG flag, remove the flag to profile. Some drivers (NVIDIA in particular) command recording perform very poorly with that flag.
Second, driver will perform quite a bunch of checks when you call Dispatch (check that resources are of the correct format, correct strides, resource still alive....). DirectX driver tends to do a lot of checks, so it might be slightly slower than GL one (but not by that magnitude, which leads to the last point).
Last, it is likely that the GPU/Driver does a warm up on your shader (some drivers convert the dx bytecode to their native counterpart asynchronously, so when you call
device->CreateComputeShader();
It might be done immediately or placed in a queue (AME does the queue thing, see this link Gpu Open Shader Compiler controls).
If you call Dispatch before this task is effectively processed, you might have a wait as well.
Also note that most GPU have a cache on disk nowadays, so the first compile/use might also impact performances.
So you should try to call Dispatch several times, and check if the CPU timings are different after the first call.
I have a hexagonal grid that I want to texture. I want to use a single texture with 16 distinct subtextures arranged in a 4x4 grid. Each "node" in the grid has an image type, and I want to smoothly blend between them. My approach for implementing this is to render triangles in pairs, and encode the 4 image types on all vertices in the two faces, as well as a set of 4 weighting factors (which are the barycentric coordinates for the two tris). I can then use those two things to blend smoothly between any combination of image types.
Here is the fragment shader I'm using. The problems are arising from the use of int types, but I don't understand why. If i only use the first four sub-textures then i can change idx to be float and hardcode the Y-coord to be 0, and then it then works as i expect.
vec2 offset(int idx) {
vec2 v = vec2(idx % 4, idx / 4);
return v / 4.0;
}
void main(void) {
//
// divide the incoming UVs into one of 16 regions. The
// offset() function should take an integer from 0..15
// and return the offset to that region in the 4x4 map
//
vec2 uv = v_uv / 4.0;
//
// The four texture regions involved at
// this vertex are encoded in vec4 t_txt. The same
// values are stored at all vertices, so this doesn't
// vary across the triangle
//
int ia = int(v_txt.x);
int ib = int(v_txt.y);
int ic = int(v_txt.z);
int id = int(v_txt.w);
//
// Use those indices in the offset function to get the
// texture sample at that point
//
vec4 ca = texture2D(txt, uv + offset(ia));
vec4 cb = texture2D(txt, uv + offset(ib));
vec4 cc = texture2D(txt, uv + offset(ic));
vec4 cd = texture2D(txt, uv + offset(id));
//
// Merge them with the four factors stored in vec4 v_tfact.
// These vary for each vertex
//
fragcolour = ca * v_tfact.x
+ cb * v_tfact.y
+ cc * v_tfact.z
+ cd * v_tfact.w;
}
Here is what's happening:
(My "pair of triangles" are actually about 20 and you can see their structure in the artifacts, but the effect is the same)
This artifacting behaves a bit like z-fighting: moving the scene around makes it all shimmer and shift wildly.
Why doesn't this work as I expect?
One solution I can fall back on is to simply use a 1-dimensional texture map, with all 16 sub-images in a horizontal line, then i can switch everything to floating point since I won't need the modulo/integer-divide process to map idx->x,y, but this feels clumsy and I'd at least like to understand what's going on here.
Here is what it should look like, albeit with only 4 of the sub-images in use:
See OpenGL Shading Language 4.60 Specification - 5.4.1. Conversion and Scalar Constructors
When constructors are used to convert a floating-point type to an integer type, the fractional part of the floating-point value is dropped.
Hence int(v_txt.x) does not round v_txt.x, it truncates v_txt.x
You have to round the values to the nearest integer before constructing an integral value:
int ia = int(round(v_txt.x));
int ib = int(round(v_txt.y));
int ic = int(round(v_txt.z));
int id = int(round(v_txt.w));
Alternatively add 0.5 before constructing the integral value:
int ia = int(v_txt.x + 0.5);
int ib = int(v_txt.y + 0.5);
int ic = int(v_txt.z + 0.5);
int id = int(v_txt.w + 0.5);
I am currently working on my first OpenGL based game engine. I need normal mapping as a feature, but it isn't working correctly.
Here is an animation of what is Happening
The artifacts are affected by the angle between the light and the normals on the surface. Camera movement does not affect it in any way. I am also (at least for now) going the route of the less efficient method where the normal extracted from the normal map is converted into view space rather than converting everything to tangent space.
Here are the relevant pieces of my code:
Generating Tangents and Bitangents
for(int k=0;k<(int)mb->getIndexCount();k+=3)
{
unsigned int i1 = mb->getIndex(k);
unsigned int i2 = mb->getIndex(k+1);
unsigned int i3 = mb->getIndex(k+2);
JGE_v3f v0 = mb->getVertexPosition(i1);
JGE_v3f v1 = mb->getVertexPosition(i2);
JGE_v3f v2 = mb->getVertexPosition(i3);
JGE_v2f uv0 = mb->getVertexUV(i1);
JGE_v2f uv1 = mb->getVertexUV(i2);
JGE_v2f uv2 = mb->getVertexUV(i3);
JGE_v3f deltaPos1 = v1-v0;
JGE_v3f deltaPos2 = v2-v0;
JGE_v2f deltaUV1 = uv1-uv0;
JGE_v2f deltaUV2 = uv2-uv0;
float ur = deltaUV1.x * deltaUV2.y - deltaUV1.y * deltaUV2.x;
if(ur != 0)
{
float r = 1.0 / ur;
JGE_v3f tangent;
JGE_v3f bitangent;
tangent = ((deltaPos1 * deltaUV2.y) - (deltaPos2 * deltaUV1.y)) * r;
tangent.normalize();
bitangent = ((deltaPos1 * -deltaUV2.x) + (deltaPos2 * deltaUV1.x)) * r;
bitangent.normalize();
tans[i1] += tangent;
tans[i2] += tangent;
tans[i3] += tangent;
btans[i1] += bitangent;
btans[i2] += bitangent;
btans[i3] += bitangent;
}
}
Calculating the TBN matrix in the Vertex Shader
(mNormal corrects the normal for non-uniform scales)
vec3 T = normalize((mVW * vec4(tangent, 0.0)).xyz);
tnormal = normalize((mNormal * n).xyz);
vec3 B = normalize((mVW * vec4(bitangent, 0.0)).xyz);
tmTBN = transpose(mat3(
T.x, B.x, tnormal.x,
T.y, B.y, tnormal.y,
T.z, B.z, tnormal.z));
Finally here is where I use the sampled normal from the normal map and attempt to convert it to view space in the Fragment Shader
fnormal = normalize(nmapcolor.xyz * 2.0 - 1.0);
fnormal = normalize(tmTBN * fnormal);
"nmapcolor" is the sampled color from the normal map.
"fnormal" is then used like normal in the lighting calculations.
I have been trying to solve this for so long and have absolutely no idea how to get this working. Any help would be greatly appreciated.
EDIT - I slightly modified the code to work in world space and outputted the results. The big platform does not have normal mapping (and it works correctly) while the smaller platform does.
I added in what direction the normals are facing. They should both be generally the same color, but they're clearly different. Seems the mTBN matrix isn't transforming the tangent space normal into world (and normally view) space properly.
Well... I solved the problem. Turns out my normal mapping implementation was perfect. The problem actually was in my texture class. This is, of course, my first time writing an OpenGL rendering engine, and I did not realize that the unlock() function in my texture class saved ALL my textures as GL_SRGB_ALPHA including normal maps. Only diffuse map textures should be GL_SRGB_ALPHA. Temporarily forcing all textures to load as GL_RGBA fixed the problem.
Can't believe I had this problem for 11 months, only to find it was something so small.
I am having problems calculating normals after tesselation.
Currently I have code which samples height map and calculates normal from that:
float HEIGHT = 2048.0f;
float WIDTH =2048.0f;
float SCALE =displace_ratio;
vec2 uv = tex_coord_FS_in.xy;
vec2 du = vec2(1/WIDTH, 0);
vec2 dv= vec2(0, 1/HEIGHT);
float dhdu = SCALE/(2/WIDTH) * (texture(height_tex, uv+du).r - texture(height_tex, uv-du).r);
float dhdv = SCALE/(2/HEIGHT) * (texture(height_tex, uv+dv).r - texture(height_tex, uv-dv).r);
N = normalize(N+T*dhdu+B*dhdv);
But doesn't look ok with low level tesselations
How can I get rid of this ?
Only way to get rid of this is to use a normal map in combination with the computed normals. The normals you see on the right are correct. They're just in low resolution, because you tesselate them so. Use a normal map and per-pixel lighting to highlight the intricate details.
Also, one thing to consider is the topology of your initial mesh. More evenly spaced polygons result in more evenly spaced tesselation.
Additionally, you might want to do, instead of:
float dhdu = SCALE/(2/WIDTH) * (texture(height_tex, uv+du).r - texture(height_tex, uv-du).r);
float dhdv = SCALE/(2/HEIGHT) * (texture(height_tex, uv+dv).r - texture(height_tex, uv-dv).r);
sample a few more points from the heightmap, and average them to extract a more averaged version of the normal at each point.