Compare two states into the shader - opengl

How can I compare into the shader two states from different frames by means of global variable? I need compare states of mouse position between two frames, and if it didn't change, to do {bla bla bla}.
For example:
vec2 focusNew = vec2(0.0);
float x;
float y;
void main
{
vec2 focus = vec2 ( x, y-1);
if ((focusNew - focus) <= 0.00001) // (focusNew == focus)
{bla bla bla}
focusNew = focus;
}
But focusNew doesn't save current state.

You can't. Or at least not that way. Remember: shaders are run thousands of times per frame.
I would explain how you might actually do that, but it's abundantly clear that you don't really mean what you're saying you mean. The mouse state changes from frame to frame. But that's all stuff that happens on the CPU, and it happens once, not once per shader. Every shader therefore would compute the same value.
So there's no point in making the shader do it. Just do the condition on the CPU, then provide a uniform that tells the shader(s) whether or not to do the {bla bla bla}.

Related

Stuck trying to optimize complex GLSL fragment shader

So first off, let me say that while the code works perfectly well from a visual point of view, it runs into very steep performance issues that get progressively worse as you add more lights. In its current form it's good as a proof of concept, or a tech demo, but is otherwise unusable.
Long story short, I'm writing a RimWorld-style game with real-time top-down 2D lighting. The way I implemented rendering is with a 3 layered technique as follows:
First I render occlusions to a single-channel R8 occlusion texture mapped to a framebuffer. This part is lightning fast and doesn't slow down with more lights, so it's not part of the problem:
Then I invoke my lighting shader by drawing a huge rectangle over my lightmap texture mapped to another framebuffer. The light data is stored in an array in an UBO and it uses the occlusion mapping in its calculations. This is where the slowdown happens:
And lastly, the lightmap texture is multiplied and added to the regular world renderer, this also isn't affected by the number of lights, so it's not part of the problem:
The problem is thus in the lightmap shader. The first iteration had many branches which froze my graphics driver right away when I first tried it, but after removing most of them I get a solid 144 fps at 1440p with 3 lights, and ~58 fps at 1440p with 20 lights. An improvement, but it scales very poorly. The shader code is as follows, with additional annotations:
#version 460 core
// per-light data
struct Light
{
vec4 location;
vec4 rangeAndstartColor;
};
const int MaxLightsCount = 16; // I've also tried 8 and 32, there was no real difference
layout(std140) uniform ubo_lights
{
Light lights[MaxLightsCount];
};
uniform sampler2D occlusionSampler; // the occlusion texture sampler
in vec2 fs_tex0; // the uv position in the large rectangle
in vec2 fs_window_size; // the window size to transform world coords to view coords and back
out vec4 color;
void main()
{
vec3 resultColor = vec3(0.0);
const vec2 size = fs_window_size;
const vec2 pos = (size - vec2(1.0)) * fs_tex0;
// process every light individually and add the resulting colors together
// this should be branchless, is there any way to check?
for(int idx = 0; idx < MaxLightsCount; ++idx)
{
const float range = lights[idx].rangeAndstartColor.x;
const vec2 lightPosition = lights[idx].location.xy;
const float dist = length(lightPosition - pos); // distance from current fragment to current light
// early abort, the next part is expensive
// this branch HAS to be important, right? otherwise it will check crazy long lines against occlusions
if(dist > range)
continue;
const vec3 startColor = lights[idx].rangeAndstartColor.yzw;
// walk between pos and lightPosition to find occlusions
// standard line DDA algorithm
vec2 tempPos = pos;
int lineSteps = int(ceil(abs(lightPosition.x - pos.x) > abs(lightPosition.y - pos.y) ? abs(lightPosition.x - pos.x) : abs(lightPosition.y - pos.y)));
const vec2 lineInc = (lightPosition - pos) / lineSteps;
// can I get rid of this loop somehow? I need to check each position between
// my fragment and the light position for occlusions, and this is the best I
// came up with
float lightStrength = 1.0;
while(lineSteps --> 0)
{
const vec2 nextPos = tempPos + lineInc;
const vec2 occlusionSamplerUV = tempPos / size;
lightStrength *= 1.0 - texture(occlusionSampler, vec2(occlusionSamplerUV.x, 1 - occlusionSamplerUV.y)).x;
tempPos = nextPos;
}
// the contribution of this light to the fragment color is based on
// its square distance from the light, and the occlusions between them
// implemented as multiplications
const float strength = max(0, range - dist) / range * lightStrength;
resultColor += startColor * strength * strength;
}
color = vec4(resultColor, 1.0);
}
I call this shader as many times as I need, since the results are additive. It works with large batches of lights or one by one. Performance-wise, I didn't notice any real change trying different batch numbers, which is perhaps a bit odd.
So my question is, is there a better way to look up for any (boolean) occlusions between my fragment position and light position in the occlusion texture, without iterating through every pixel by hand? Could render buffers perhaps help here (from what I've read they're for reading data back to system memory, I need it in another shader though)?
And perhaps, is there a better algorithm for what I'm doing here?
I can think of a couple routes for optimization:
Exact: apply a distance transform on the occlusion map: this will give you the distance to the nearest occluder at each pixel. After that you can safely step by that distance within the loop, instead of doing baby steps. This will drastically reduce the number of steps in open regions.
There is a very simple CPU-side algorithm to compute a DT, and it may suit you if your occluders are static. If your scene changes every frame, however, you'll need to search the literature for GPU side algorithms, which seem to be more complicated.
Inexact: resort to soft shadows -- it might be a compromise you are willing to make, and even seen as an artistic choice. If you are OK with that, you can create a mipmap from your occlusion map, and then progressively increase the step and sample lower levels as you go farther from the point you are shading.
You can go further and build an emitters map (into the same 4-channel map as the occlusion). Then your entire shading pass will be independent of the number of lights. This is an equivalent of voxel cone tracing GI applied to 2D.

GLSL artefacts when ray marching

In the following shadertoy I illustrate an artefact that occurs when raymarching
https://www.shadertoy.com/view/stdGDl
This is my "scene" (see code fragment below). It renders a primitive "tunnel_fragment" which is an SDF (Signed Distance Function), and uses modulo on the coordinates to calculate "infinite" repetitions of these fragments. It then also calculates which disk we are in (odd/even) to displace them.
I really don't understand why these artefacts occur when the disks (or rings -> see tunnel_fragment, if you remove a comment they become rings instead of disks) present these artefacts when the alternate movement in x direction becomes large.
These artefacts don't appear when the disk structure moves to the right on its whole, it only appears when the disks alternate and the entire structure becomes more complex.
What am I doing wrong? It's really boggling me.
vec2 scene(in vec3 p)
{
float thick = 0.1;
vec3 cp = p;
// Use modulo to simulate inf disks
vec3 c = vec3(0,0,6.0*thick);
vec3 q = mod(cp+0.5*c,c)-0.5*c;
// Find index of the disk
vec3 disk = (cp+0.5*c) / (c);
float idx = floor(disk.z);
// Do something simple with odd/even disks
// Note: changing this shows the artefacts are always there
if(mod(idx,2.0) == 0.0) {
q.x += sin(disk.z*t)*t*t;
} else {
q.x -= sin(disk.z*t)*t*t;
}
float d = tunnel_fragment(q, vec3(0.0), vec3(0.0, 0.0, 1.0), 2.0, thick, 0.2);
return vec2(d, idx);
}
The problem is illustrated with this diagram:
When the current disk (based on modulo) is offset by more than the spacing between the disks, then the distance that you calculate is larger than the distance to the next disk. Consequently you risk in over-stepping the next disk.
To solve this you need to either limit the offset (as said -- no more than the spacing between the disks), or sample odd/even disks separately and min() between them.

GLSL while loop performance is independent from work done inside of it

I'm currently trying to implement a path tracer inside a fragment shader which leverages a very simple BVH
The code for the BVH intersection is based on the following idea:
bool BVHintersects( Ray ray ) {
Object closestObject;
vec2 toVisit[100]; // using a stack to keep track which node should be tested against the current ray
int stackPointer = 1;
toVisit[0] = vec2(0.0, 0.0); // coordinates of the root node in the BVH hierarcy
while(stackPointer > 0) {
stackPointer--; // pop the BVH node to examine
if(!leaf) {
// examine the BVH node and eventually update the stackPointer and toVisit
}
if(leaf) {
// examine the leaf and eventually update the closestObject entry
}
}
}
The problem with the above code, is that on the second light bounce something very strange starts to happen, assuming I'm calculating light bounces this way:
vec3 color = vec3(0.0);
vec3 normal = vec3(0.0);
// first light bounce
bool intersects = BVHintersect(ro, rd, color, normal);
vec3 lightPos = vec3(5, 15, 0);
// updating ray origin & direction
ro = ro + rd * (t - 0.01);
rd = normalize(lightPos - ro);
// second light bounce used only to calculate shadows
bool shadowIntersects = BVHintersect(ro, rd, color, normal);
The second call to BVHintersect would run indefinitely, because the while loop never exits, but from many tests I've done on that second call I'm sure that the stackPointer eventually goes back to 0 successfully, in fact if I place the following code just under the while loop:
int iterationsMade = 0;
while(stackPointer > 0) {
iterationsMade++;
if(iterationsMade > 100) {
break;
}
// the rest of the loop
// after the functions ends it also returns "iterationsMade"
the variable "iterationsMade" is always under 100, the while loop doesn't run infinitely, but performance wise it's as if I did "100" iterations, even if "iterationsMade" is never bigger than say 10 or 20. Increasing the hardcoded "100" to a bigger value would linearly degrade performance
What could be the possible causes for this behaviour? What's a possible reason for that second call to BVHIntersect to get stuck inside that while loop if it never does more than 10-20 iterations?
Source for the the BVHintersect function:
https://pastebin.com/60SYRQAZ
So, there's a funny thing about loops in shaders (or most SIMD circumstances):
The entire wave will take at least as long to execute as the slowest thread. So, if one thread needs to take ~100 iterations, then they ALL take 100 iterations. Depending on your platform and compiler, the loop may be unrolled to 100 iterations (or whatever upper bound you choose). Anything after the break won't affect the final output, but the rest of the unrolled loop will still have to be processed. Early-out isn't always possible.
There are a number of ways around this, but perhaps the most straightforward is to do this in multiple passes with a lower max iterations value.
I would also run your shader through a compiler and look at the generated code. Compare different versions with different max iterations and look at things like the length of compiled shader.
See this answer for a little more information.

cocos2d, Splitting an image into serpate R B G channels?

I want to create an effect where after my character gets killed, the red, blue, green color channels of the characters sprite separate into different directions.
something similar to this > http://active.tutsplus.com/tutorials/effects/create-a-retro-crt-distortion-effect-using-rgb-shifting/
How would I go about doing this?
You could just add different offsets when looking up the individual colors in the fragment shader. To make this efficient you should probably render to an intermediate buffer first.
Here is an example of how to do it:
vec4 mainOld( vec2 offset ) {
... (gl_FragCoord.xy + offset) ...
}
void main( void ) {
vec4 foo;
foo.r = mainOld(vec2(-3.0, 0.0)).r;
foo.g = mainOld(vec2(0.0, 5.0)).g;
foo.b = mainOld(vec2(0.0, 0.0)).b;
foo.a = mainOld(vec2(0.0, 0.0)).a;
gl_FragColor = foo;
}
Basically the original shader is now called three times so that's a bit inefficient, which is why I suggested a buffer but that may be premature optimization.
You can look at the result of the above code in an actual shader here:
http://glsl.heroku.com/e#7971.0 (not sure how persistent these links are, sorry)

GLSL SpinLock only Mostly Works

I have implemented a depth peeling algorithm using a GLSL spinlock (inspired by this). In the following visualization, notice how overall the depth peeling algorithm functions correctly (first layer top left, second layer top right, third layer bottom left, fourth layer bottom right). The four depth layers are stored into a single RGBA texture.
Unfortunately, the spinlock sometimes fails to prevent errors--you can see little white speckles, particularly in the fourth layer. There's also one on the wing of the spaceship in the second layer. These speckles vary each frame.
In my GLSL spinlock, when a fragment is to be drawn, the fragment program reads and write a locking value into a separate locking texture atomically, waiting until a 0 shows up, indicating that the lock is open. In practice, I found that the program must be parallel, because if two threads are on the same pixel, the warp cannot continue (one must wait, while the other continues, and all threads in a GPU thread warp must execute simultaneously).
My fragment program looks like this (comments and spacing added):
#version 420 core
//locking texture
layout(r32ui) coherent uniform uimage2D img2D_0;
//data texture, also render target
layout(RGBA32F) coherent uniform image2D img2D_1;
//Inserts "new_data" into "data", a sorted list
vec4 insert(vec4 data, float new_data) {
if (new_data<data.x) return vec4( new_data,data.xyz);
else if (new_data<data.y) return vec4(data.x,new_data,data.yz);
else if (new_data<data.z) return vec4(data.xy,new_data,data.z);
else if (new_data<data.w) return vec4(data.xyz,new_data );
else return data;
}
void main() {
ivec2 coord = ivec2(gl_FragCoord.xy);
//The idea here is to keep looping over a pixel until a value is written.
//By looping over the entire logic, threads in the same warp aren't stalled
//by other waiting threads. The first imageAtomicExchange call sets the
//locking value to 1. If the locking value was already 1, then someone
//else has the lock, and can_write is false. If the locking value was 0,
//then the lock is free, and can_write is true. The depth is then read,
//the new value inserted, but only written if can_write is true (the
//locking texture was free). The second imageAtomicExchange call resets
//the lock back to 0.
bool have_written = false;
while (!have_written) {
bool can_write = (imageAtomicExchange(img2D_0,coord,1u) != 1u);
memoryBarrier();
vec4 depths = imageLoad(img2D_1,coord);
depths = insert(depths,gl_FragCoord.z);
if (can_write) {
imageStore(img2D_1,coord,depths);
have_written = true;
}
memoryBarrier();
imageAtomicExchange(img2D_0,coord,0);
memoryBarrier();
}
discard; //Already wrote to render target with imageStore
}
My question is why this speckling behavior occurs? I want the spinlock to work 100% of the time! Could it relate to my placement of memoryBarrier()?
For reference, here is locking code that has been tested to work on Nvidia driver 314.22 & 320.18 on a GTX670. Note that existing compiler optimization bugs are triggered if the code is reordered or rewritten to logically equivalent code (see comments below.) Note in the below I use bindless image references.
// sem is initialized to zero
coherent uniform layout(size1x32) uimage2D sem;
void main(void)
{
ivec2 coord = ivec2(gl_FragCoord.xy);
bool done = false;
uint locked = 0;
while(!done)
{
// locked = imageAtomicCompSwap(sem, coord, 0u, 1u); will NOT work
locked = imageAtomicExchange(sem, coord, 1u);
if (locked == 0)
{
performYourCriticalSection();
memoryBarrier();
imageAtomicExchange(sem, coord, 0u);
// replacing this with a break will NOT work
done = true;
}
}
discard;
}
The "imageAtomicExchange(img2D_0,coord,0);" needs to be inside the if statement, since it is resetting the lock variable even for threads that didn't have it! Changing this fixes it.