OpenGL ImageStore slower when storing certain values - opengl

I am writing on some 3D textures that are defined as follows:
layout (binding = 0, rgba8) coherent uniform image3D volumeGeom[MIPLEVELS];
layout (binding = 4, rgba8) coherent uniform image3D volumeNormal[MIPLEVELS];
and I am writing the following values
float df = dot(normalize(In.WorldNormal), -gSpotLight.Direction);
if(df < 0) df = 0;
else df = 1;
fragmentColor = texture2D(gSampler, In.TexCoord0.xy);
imageStore(volumeGeom[In.GeomInstance], ivec3(coords1), fragmentColor);
imageStore(volumeNormal[In.GeomInstance], ivec3(coords1), vec4(normalize(In.WorldNormal), 1.0));
fragmentColor = vec4(fragmentColor.xyz*CalcShadowFactor(In.LightSpacePos)*df,1.0);
as you can see I write the first fragmentColor and I purposely put the other fragmentColor after the store even though I don't do anything with it. This configuration runs at 21 FPS. If I then do this
float df = dot(normalize(In.WorldNormal), -gSpotLight.Direction);
if(df < 0) df = 0;
else df = 1;
fragmentColor = texture2D(gSampler, In.TexCoord0.xy);
fragmentColor = vec4(fragmentColor.xyz*CalcShadowFactor(In.LightSpacePos)*df,1.0);
imageStore(volumeGeom[In.GeomInstance], ivec3(coords1), fragmentColor);
imageStore(volumeNormal[In.GeomInstance], ivec3(coords1), vec4(normalize(In.WorldNormal), 1.0));
in which the second fragmentColor is computed before and stored in volumeGeom the entire thing runs at 13 FPS. This means that imageStore is running slower depending on the values I am writing in. Is that because of some compiler optimization?
Basically the second fragmentColor is 0 for surfaces that are in shadow or backface the light.

Not sure how complicated your CalcShadowFactor method is, but it is likely that in the first version, the image stores can overlap with your calculation, while in the second, your calculation has to end first before the stores can be called. This, plus the increased register pressure, can be already enough to make the #2 version much slower than the #1.
If this is the complete source code, it is also likely that in the #1 case the call to CalcShadowFactor is completely optimized out, as the result is never used (if fragmentColor is never read again and is not an output.)

Related

Stuck trying to optimize complex GLSL fragment shader

So first off, let me say that while the code works perfectly well from a visual point of view, it runs into very steep performance issues that get progressively worse as you add more lights. In its current form it's good as a proof of concept, or a tech demo, but is otherwise unusable.
Long story short, I'm writing a RimWorld-style game with real-time top-down 2D lighting. The way I implemented rendering is with a 3 layered technique as follows:
First I render occlusions to a single-channel R8 occlusion texture mapped to a framebuffer. This part is lightning fast and doesn't slow down with more lights, so it's not part of the problem:
Then I invoke my lighting shader by drawing a huge rectangle over my lightmap texture mapped to another framebuffer. The light data is stored in an array in an UBO and it uses the occlusion mapping in its calculations. This is where the slowdown happens:
And lastly, the lightmap texture is multiplied and added to the regular world renderer, this also isn't affected by the number of lights, so it's not part of the problem:
The problem is thus in the lightmap shader. The first iteration had many branches which froze my graphics driver right away when I first tried it, but after removing most of them I get a solid 144 fps at 1440p with 3 lights, and ~58 fps at 1440p with 20 lights. An improvement, but it scales very poorly. The shader code is as follows, with additional annotations:
#version 460 core
// per-light data
struct Light
{
vec4 location;
vec4 rangeAndstartColor;
};
const int MaxLightsCount = 16; // I've also tried 8 and 32, there was no real difference
layout(std140) uniform ubo_lights
{
Light lights[MaxLightsCount];
};
uniform sampler2D occlusionSampler; // the occlusion texture sampler
in vec2 fs_tex0; // the uv position in the large rectangle
in vec2 fs_window_size; // the window size to transform world coords to view coords and back
out vec4 color;
void main()
{
vec3 resultColor = vec3(0.0);
const vec2 size = fs_window_size;
const vec2 pos = (size - vec2(1.0)) * fs_tex0;
// process every light individually and add the resulting colors together
// this should be branchless, is there any way to check?
for(int idx = 0; idx < MaxLightsCount; ++idx)
{
const float range = lights[idx].rangeAndstartColor.x;
const vec2 lightPosition = lights[idx].location.xy;
const float dist = length(lightPosition - pos); // distance from current fragment to current light
// early abort, the next part is expensive
// this branch HAS to be important, right? otherwise it will check crazy long lines against occlusions
if(dist > range)
continue;
const vec3 startColor = lights[idx].rangeAndstartColor.yzw;
// walk between pos and lightPosition to find occlusions
// standard line DDA algorithm
vec2 tempPos = pos;
int lineSteps = int(ceil(abs(lightPosition.x - pos.x) > abs(lightPosition.y - pos.y) ? abs(lightPosition.x - pos.x) : abs(lightPosition.y - pos.y)));
const vec2 lineInc = (lightPosition - pos) / lineSteps;
// can I get rid of this loop somehow? I need to check each position between
// my fragment and the light position for occlusions, and this is the best I
// came up with
float lightStrength = 1.0;
while(lineSteps --> 0)
{
const vec2 nextPos = tempPos + lineInc;
const vec2 occlusionSamplerUV = tempPos / size;
lightStrength *= 1.0 - texture(occlusionSampler, vec2(occlusionSamplerUV.x, 1 - occlusionSamplerUV.y)).x;
tempPos = nextPos;
}
// the contribution of this light to the fragment color is based on
// its square distance from the light, and the occlusions between them
// implemented as multiplications
const float strength = max(0, range - dist) / range * lightStrength;
resultColor += startColor * strength * strength;
}
color = vec4(resultColor, 1.0);
}
I call this shader as many times as I need, since the results are additive. It works with large batches of lights or one by one. Performance-wise, I didn't notice any real change trying different batch numbers, which is perhaps a bit odd.
So my question is, is there a better way to look up for any (boolean) occlusions between my fragment position and light position in the occlusion texture, without iterating through every pixel by hand? Could render buffers perhaps help here (from what I've read they're for reading data back to system memory, I need it in another shader though)?
And perhaps, is there a better algorithm for what I'm doing here?
I can think of a couple routes for optimization:
Exact: apply a distance transform on the occlusion map: this will give you the distance to the nearest occluder at each pixel. After that you can safely step by that distance within the loop, instead of doing baby steps. This will drastically reduce the number of steps in open regions.
There is a very simple CPU-side algorithm to compute a DT, and it may suit you if your occluders are static. If your scene changes every frame, however, you'll need to search the literature for GPU side algorithms, which seem to be more complicated.
Inexact: resort to soft shadows -- it might be a compromise you are willing to make, and even seen as an artistic choice. If you are OK with that, you can create a mipmap from your occlusion map, and then progressively increase the step and sample lower levels as you go farther from the point you are shading.
You can go further and build an emitters map (into the same 4-channel map as the occlusion). Then your entire shading pass will be independent of the number of lights. This is an equivalent of voxel cone tracing GI applied to 2D.

OpenGL Shader Storage Buffer / memoryBarrierBuffer

I currently created two SSBO's to handle some lights because the VS-FS in out interface can't handle a lot of lights (Im using forward shading).
For the first one I only pass the values to the shader (basically a read only one) [cpp]:
struct GLightProperties
{
unsigned int numLights;
LightProperties properties[];
};
...
glp = (GLightProperties*)malloc(sizeof(GLightProperties) + sizeof(LightProperties) * lastSize);
...
glGenBuffers(1, &ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo);
glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(GLightProperties) + sizeof(LightProperties) * lastSize, glp, GL_DYNAMIC_COPY);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
Shader file [GLSL]:
layout(std430, binding = 1) buffer Lights
{
uint numLights;
LightProperties properties[];
}lights;
So this first SSBO turns out to work fine. However, in the other one, which purpose is VS-FS interface, has some issues:
glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo2);
glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(float) * 4 * 3 * lastSize, nullptr, GL_DYNAMIC_COPY);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
GLSL:
struct TangentProperties
{
vec4 TangentLightPos;
vec4 TangentViewPos;
vec4 TangentFragPos;
};
layout(std430, binding = 0) buffer TangentSpace
{
TangentProperties tangentProperties[];
}tspace;
So here you notice I pass nullptr to the glBufferData because the vs will write in the buffer and the fs will read its contents.
Like so in the VS Stage:
for(int i = 0; i < lights.numLights; i++)
{
tspace.tangentProperties[index].TangentLightPos.xyz = TBN * lights.properties[index].lightPosition.xyz;
tspace.tangentProperties[index].TangentViewPos.xyz = TBN * camPos;
tspace.tangentProperties[index].TangentFragPos.xyz = TBN * vec3(worldPosition);
memoryBarrierBuffer();
}
After this the FS reads the values, which turn out to be just garbage. Am I doing something wrong with memory barriers?
The output turns out this way:
OK, let's get the obvious bug out of the way:
for(int i = 0; i < lights.numLights; i++)
{
tspace.tangentProperties[index].TangentLightPos.xyz = TBN * lights.properties[index].lightPosition.xyz;
tspace.tangentProperties[index].TangentViewPos.xyz = TBN * camPos;
tspace.tangentProperties[index].TangentFragPos.xyz = TBN * vec3(worldPosition);
memoryBarrierBuffer();
}
index never changes in this loop, so you're only writing a single light, and you're only writing the last lights' values. All other lights will have garbage/undefined values.
So you probably meant i rather than index.
But that's only the beginning of the problem. See, if you make that change, you get this:
for(int i = 0; i < lights.numLights; i++)
{
tspace.tangentProperties[i].TangentLightPos.xyz = TBN * lights.properties[i].lightPosition.xyz;
tspace.tangentProperties[i].TangentViewPos.xyz = TBN * camPos;
tspace.tangentProperties[i].TangentFragPos.xyz = TBN * vec3(worldPosition);
}
memoryBarrierBuffer();
Note that the barrier is outside the loop.
That creates a new problem. This code will have every vertex shader invocation writing to the same memory buffer. SSBOs, after all, are not VS output variables. Output variables are stored as part of a vertex. The rasterizer then interpolates this vertex data across the primitive as it rasterizes it, which provides the input values for the FS. So one VS cannot stomp on the output variables of another VS.
That doesn't happen with SSBOs. Every VS is acting on the same SSBO memory. So if they write to the same indices of the same array, they're writing to the same memory address. Which is a race condition (since there can be no synchronization between sibling invocations) and therefore undefined behavior.
So, the only way what you're trying to do could possibly work is if your buffer has numLights entries for each vertex in the entire scene.
This is a fundamentally unreasonable amount of space. Even if you could get it down to just the number of vertices in a particular draw call (which is doable, but I'm not going to say how), you would still be behind in performance. Every FS invocation will have to perform reads of 144 bytes of data for each light (3 table entries, one for each vertex of the triangle), linearly interpolate those values, and then do lighting computations.
It would be faster for you to just pass the TBN matrix as a VS output and do the matrix multiplies in the FS. Yes, that's a lot of matrix multiplies, but GPUs are really fast at matrix multiplies, and are really slow at reading memory.
Also, reconsider whether you need the tangent-space fragment position. Generally speaking, you never do.

Pass array of floats to fragment shader via texture

I am trying to pass an array of floats (in my case an audio wave) to a fragment shader via texture. It works but I get some imperfections as if the value read from the 1px height texture wasn't reliable.
This happens with many combinations of bar widths and amounts.
I get the value from the texture with:
precision mediump float;
...
uniform sampler2D uDisp;
...
void main(){
...
float columnWidth = availableWidth / barsCount;
float barIndex = floor((coord.x-paddingH)/columnWidth);
float textureX = min( 1.0, (barIndex+1.0)/barsCount );
float barValue = texture2D(uDisp, vec2(textureX, 0.0)).r;
...
If instead of the value from the texture I use something else the issue doesn't seem to be there.
barValue = barIndex*0.1;
Any idea what could be the issue? Is using a texture for this purpose a bad idea?
I am using Pixi.JS as WebGL framework, so I don't have access to low level APIs.
With a gradient texture for the data and many bars the problems becomes pretty evident.
Update: Looks like the issue relates to the consistency of the value of textureX.
Trying different formulas like barIndex/(barsCount-1.0) results in less noise. Wrapping it on a min definitely adds more noise.
Turned out the issue wasn't in reading the values from the texture, but was in the drawing. Instead of using IFs I switched to step and the problem went away.
vec2 topLeft = vec2(
paddingH + (barIndex*columnWidth) + ((columnWidth-barWidthInPixels)*0.5),
top
);
vec2 bottomRight = vec2(
topLeft.x + barWidthInPixels,
bottom
);
vec2 tl = step(topLeft, coord);
vec2 br = 1.0-step(bottomRight, coord);
float blend = tl.x * tl.y * br.x * br.y;
I guess comparisons of floats through IFs are not very reliable in shaders.
Generally mediump is insufficient for texture coordinates for any non-trivial texture, so where possible use highp. This isn't always available on some older GPUs, so depending on the platform this may not solve your problem.
If you know you are doing 1:1 mapping then also use GL_NEAREST rather than GL_LINEAR, as the quantization effect will more likely hide some of the precision side-effects.
Given you probably know the number of columns and bars you can probably pre-compute some of the values on the CPU (e.g. precompute 1/columns and pass that as a uniform) at fp32 precision. Passing in small values between 0 and 1 is always much better at preserving floating point accuracy, rather than passing in big values and then dividing out.

Access GLSL Uniforms after shaders have used them

I've noticed that my shaders are performing a calculation that I need in the CPU code. Is it possible for me to load that results of that calculation into a uniform array, and then access that uniform from the CPU once the GPU has finished working?
You can write arbitrary amounts of data through either Image Load/Store or SSBOs. While the number of image variables is restricted in image load/store, those variables can refer to buffer textures or array textures. Either of which give you access to a more-or-less arbitrarily large amount of data to write to:
layout(rgba32f, writeOnly) imageBuffer buffer;
imageStore(buffer, valueOffset1, value1);
imageStore(buffer, valueOffset2, value2);
imageStore(buffer, valueOffset3, value3);
imageStore(buffer, valueOffset4, value4);
SSBOs make this even easier:
layout(std430) buffer Data
{
float giantArray[];
};
giantArray[valueOffset1] = data1;
giantArray[valueOffset2] = data2;
giantArray[valueOffset3] = data3;
giantArray[valueOffset4] = data4;
However, note that any such writes will be unordered with regard to writes from other shader invocations. So overwriting such data will be... problematic. And you'll need an appropriate glMemoryBarrier call before you try to read from it.
But if all you're doing is a compute operation, you ought to be using dedicated compute shaders.
As far as i know, there is no way of retreiving uniform data from your GPU. But you could execute the calculation and set the output color to something you can identify on your screen depending on the expected result of your calculation. For exmaple:
#version 330 core
layout(location = 0) out vec4 color;
void main() {
if( Something you're trying to debug )
color = vec4(1, 1, 1, 1);
else
color = vec4(0, 0, 0, 1);
}
That's the only way I know of, and I use it all the time.

How to extend vertex shader capabalities for GPGPU

I'm trying to implement Scrypt hasher (for LTC miner) on GLSL (don't ask me why).
And, actually, I'm stucked with HMAC SHA-256 algorithm. Despite I've implemented SHA-256 correctly (it retuns corrent hash for input), fragment shader stops to compile when I add the last step (hashing previous hash concated with oKey).
The shader can't do more than three rounds of SHA-256. It just stops to compile. What are the limits? It doesn't use much memory, 174 vec2 objects in total. It seems, it doesn't relate to memory, because any extra SHA256 round doesn't require new memory. And it seems, it doesn't relate to viewport size. It stops to work on both 1x1 and 1x128 viewports.
I've started to do miner on WebGL, but after limit appeared, I've tried to run the same shader in the Qt on the full featured OpenGL. In result, desktop OpenGL allows one SHA256 round lesser then OpenGL ES in WebGL (why?).
Forgot to mention. Shader fails on the linkage stage. The shader compiles well itself, but the program linkage fails.
I don't use any textures, any extensions, slow things etc. Just simple square (4 vec2 vertecies) and several uniforms for fragment shader.
Input data is just 80 bytes, the result of fragment shader is binary (black or white), so the task ideally fits the GLSL principes.
My videocard is Radeon HD7970 with plenty of VRAM, which is able to fit hundreds of scrypt threads (scrypt uses 128kB per hash, but I can't achieve just HMAC-SHA-256). My card supports OpenGL 4.4.
I'm newbie in OpenGL, and may understand something wrong. I understand that fragment shader runs for each pixel separately, but if I have 1x128 viewport, there are only 128x348 bytes used. Where is the limit of fragment shader.
Here is the common code I use to let you understand, how I'm trying to solve the problem.
uniform vec2 base_nonce[2];
uniform vec2 header[20]; /* Header of the block */
uniform vec2 H[8];
uniform vec2 K[64];
void sha256_round(inout vec2 w[64], inout vec2 t[8], inout vec2 hash[8]) {
for (int i = 0; i < 64; i++) {
if( i > 15 ) {
w[i] = blend(w[i-16], w[i-15], w[i-7], w[i-2]);
}
_s0 = e0(t[0]);
_maj = maj(t[0],t[1],t[2]);
_t2 = safe_add(_s0, _maj);
_s1 = e1(t[4]);
_ch = ch(t[4], t[5], t[6]);
_t1 = safe_add(safe_add(safe_add(safe_add(t[7], _s1), _ch), K[i]), w[i]);
t[7] = t[6]; t[6] = t[5]; t[5] = t[4];
t[4] = safe_add(t[3], _t1);
t[3] = t[2]; t[2] = t[1]; t[1] = t[0];
t[0] = safe_add(_t1, _t2);
}
for (int i = 0; i < 8; i++) {
hash[i] = safe_add(t[i], hash[i]);
t[i] = hash[i];
}
}
void main () {
vec2 key_hash[8]; /* Our SHA-256 hash */
vec2 i_key[16];
vec2 i_key_hash[8];
vec2 o_key[16];
vec2 nonced_header[20]; /* Header with nonce */
set_nonce_to_header(nonced_header);
vec2 P[32]; /* Padded SHA-256 message */
pad_the_header(P, nonced_header);
/* Hash HMAC secret key */
sha256(P, key_hash);
/* Make iKey and oKey */
for(int i = 0; i < 16; i++) {
if (i < 8) {
i_key[i] = xor(key_hash[i], vec2(Ox3636, Ox3636));
o_key[i] = xor(key_hash[i], vec2(Ox5c5c, Ox5c5c));
} else {
i_key[i] = vec2(Ox3636, Ox3636);
o_key[i] = vec2(Ox5c5c, Ox5c5c);
}
}
/* SHA256 hash of iKey */
for (int i = 0; i < 8; i++) {
i_key_hash[i] = H[i];
t[i] = i_key_hash[i];
}
for (int i = 0; i < 16; i++) { w[i] = i_key[i]; }
sha256_round(w, t, i_key_hash);
gl_FragColor = toRGBA(i_key_hash[0]);
}
What solutions can I use to improve the situation? Is there something cool in OpenGL 4.4, in OpenGL ES 3.1? Is it even possible to do such calculations and keep so much (128kB) in fragment shader? What are limits for the vertex shader? Can I do the same on the vertex shader instead the fragment?
I try to answer on the my own question.
Shader is a small processor with limited registers and cache memory. Also, there are limit for instruction execution. So, the whole architecture to fit all into one fragment shader is wrong.
On another way, you can change your shader programs during render tens or hundreds times. It is normal practice.
It is necessary to divide big computation into smaller parts and render them separately. Use render-to-texture to save your work.
Due to the webgl statistic, 96.5% of clients has MAX_TEXTURE_SIZE eq 4096. It gives you 32 megabytes of memory. It can contain the draft data for 256 threads of scrypt computation.