Blur on Windows Phone 8 too slow - c++

I'm implementing blur effect on windows phone using native C++ with DirectX, but it looks like even the simplest blur with small kernel causes visible FPS drop.
float4 main(PixelShaderInput input) : SV_TARGET
{
float4 source = screen.Sample(LinearSampler, input.texcoord);
float4 sum = float4(0,0,0,0);
float2 sizeFactor = float2(0.00117, 0.00208);
for (int x = -2; x <= 2; x ++)
{
float2 offset = float2(x, 0) *sizeFactor;
sum += screen.Sample(LinearSampler, input.texcoord + offset);
}
return ((sum / 5) + source);
}
I'm currently using this pixel shader for 1D blur and it's visibily slower than without blur. Is it really so that WP8 phone hardware is that slow or am I making some mistake? If so, could you point me where to look for error?
Thank you.

Phones often don't have the best fill-rate, and blur is one of the worst things you can do if you're fill-rate bound. Using some numbers from gfxbench.com's Fill test, a typical phone fill rate is around 600MTex/s. With some rough math:
(600M texels/s) / (1280*720 texels/op) / (60 frames/s) ~= 11 ops/frame
So in your loop, if your surface is the entire screen, and you're doing 5 reads and 1 write, that's 6 of your 11 ops used, just for the blur. So I would say a framerate drop is expected. One way around this is to dynamically lower your resolution, and do a single linear upscale - you'll get a different kind of natural blur from the linear interpolation, which might be passable depending on the visual effect you're going for.

Related

Stuck trying to optimize complex GLSL fragment shader

So first off, let me say that while the code works perfectly well from a visual point of view, it runs into very steep performance issues that get progressively worse as you add more lights. In its current form it's good as a proof of concept, or a tech demo, but is otherwise unusable.
Long story short, I'm writing a RimWorld-style game with real-time top-down 2D lighting. The way I implemented rendering is with a 3 layered technique as follows:
First I render occlusions to a single-channel R8 occlusion texture mapped to a framebuffer. This part is lightning fast and doesn't slow down with more lights, so it's not part of the problem:
Then I invoke my lighting shader by drawing a huge rectangle over my lightmap texture mapped to another framebuffer. The light data is stored in an array in an UBO and it uses the occlusion mapping in its calculations. This is where the slowdown happens:
And lastly, the lightmap texture is multiplied and added to the regular world renderer, this also isn't affected by the number of lights, so it's not part of the problem:
The problem is thus in the lightmap shader. The first iteration had many branches which froze my graphics driver right away when I first tried it, but after removing most of them I get a solid 144 fps at 1440p with 3 lights, and ~58 fps at 1440p with 20 lights. An improvement, but it scales very poorly. The shader code is as follows, with additional annotations:
#version 460 core
// per-light data
struct Light
{
vec4 location;
vec4 rangeAndstartColor;
};
const int MaxLightsCount = 16; // I've also tried 8 and 32, there was no real difference
layout(std140) uniform ubo_lights
{
Light lights[MaxLightsCount];
};
uniform sampler2D occlusionSampler; // the occlusion texture sampler
in vec2 fs_tex0; // the uv position in the large rectangle
in vec2 fs_window_size; // the window size to transform world coords to view coords and back
out vec4 color;
void main()
{
vec3 resultColor = vec3(0.0);
const vec2 size = fs_window_size;
const vec2 pos = (size - vec2(1.0)) * fs_tex0;
// process every light individually and add the resulting colors together
// this should be branchless, is there any way to check?
for(int idx = 0; idx < MaxLightsCount; ++idx)
{
const float range = lights[idx].rangeAndstartColor.x;
const vec2 lightPosition = lights[idx].location.xy;
const float dist = length(lightPosition - pos); // distance from current fragment to current light
// early abort, the next part is expensive
// this branch HAS to be important, right? otherwise it will check crazy long lines against occlusions
if(dist > range)
continue;
const vec3 startColor = lights[idx].rangeAndstartColor.yzw;
// walk between pos and lightPosition to find occlusions
// standard line DDA algorithm
vec2 tempPos = pos;
int lineSteps = int(ceil(abs(lightPosition.x - pos.x) > abs(lightPosition.y - pos.y) ? abs(lightPosition.x - pos.x) : abs(lightPosition.y - pos.y)));
const vec2 lineInc = (lightPosition - pos) / lineSteps;
// can I get rid of this loop somehow? I need to check each position between
// my fragment and the light position for occlusions, and this is the best I
// came up with
float lightStrength = 1.0;
while(lineSteps --> 0)
{
const vec2 nextPos = tempPos + lineInc;
const vec2 occlusionSamplerUV = tempPos / size;
lightStrength *= 1.0 - texture(occlusionSampler, vec2(occlusionSamplerUV.x, 1 - occlusionSamplerUV.y)).x;
tempPos = nextPos;
}
// the contribution of this light to the fragment color is based on
// its square distance from the light, and the occlusions between them
// implemented as multiplications
const float strength = max(0, range - dist) / range * lightStrength;
resultColor += startColor * strength * strength;
}
color = vec4(resultColor, 1.0);
}
I call this shader as many times as I need, since the results are additive. It works with large batches of lights or one by one. Performance-wise, I didn't notice any real change trying different batch numbers, which is perhaps a bit odd.
So my question is, is there a better way to look up for any (boolean) occlusions between my fragment position and light position in the occlusion texture, without iterating through every pixel by hand? Could render buffers perhaps help here (from what I've read they're for reading data back to system memory, I need it in another shader though)?
And perhaps, is there a better algorithm for what I'm doing here?
I can think of a couple routes for optimization:
Exact: apply a distance transform on the occlusion map: this will give you the distance to the nearest occluder at each pixel. After that you can safely step by that distance within the loop, instead of doing baby steps. This will drastically reduce the number of steps in open regions.
There is a very simple CPU-side algorithm to compute a DT, and it may suit you if your occluders are static. If your scene changes every frame, however, you'll need to search the literature for GPU side algorithms, which seem to be more complicated.
Inexact: resort to soft shadows -- it might be a compromise you are willing to make, and even seen as an artistic choice. If you are OK with that, you can create a mipmap from your occlusion map, and then progressively increase the step and sample lower levels as you go farther from the point you are shading.
You can go further and build an emitters map (into the same 4-channel map as the occlusion). Then your entire shading pass will be independent of the number of lights. This is an equivalent of voxel cone tracing GI applied to 2D.

CoreML custom layer: Pixelwise Normalization with Metal Shaders

I'm converting Nvidia's Progressive Growing of GANs' Generator to coreML. I've managed to get everything transferred to coreML with the exception of the Pixelwise Normalization (Lambda) layer, which I plan on implementing as a custom coreML layer in Swift/Metal.
In TensorFlow.Keras, I have implemented pixel norm as
def pixelwise_norm(a):
return a / tf.sqrt(tf.reduce_mean(a * a, axis=3, keep_dims=True) + 1e-8)
Now, I've barely ever worked with shaders/Metal, but Following the instructions here: http://machinethink.net/blog/coreml-custom-layers/, I have a custom layer set up to use Metal for feedforward operations. I am using a MTLComputePipelineState that (calls? encodes?) the following shader for the layer's operations:
#include <metal_stdlib>
using namespace metal;
kernel void pixelwise_norm(
texture2d_array<half, access::read> inTexture [[texture(0)]],
texture2d_array<half, access::write> outTexture [[texture(1)]],
ushort3 gid [[thread_position_in_grid]])
{
if (gid.x >= outTexture.get_width() ||
gid.y >= outTexture.get_height()) {
return;
}
const float4 x = float4(inTexture.read(gid.xy, gid.z));
const float4 y = 0.0000001f + (x / sqrt(pow(x,2)));
outTexture.write(half4(y), gid.xy, gid.z);
}
I'm having trouble figuring out the metal equivalent of "reduce_mean", right now this shader implements a ~tensorflow ~operation like
return a / tf.sqrt((a * a) + 1e-8)
Does anyone have any pointers?
Thanks
If I'm reading this correctly, for every pixel in the feature map this divides that pixel by the L2 norm over that pixel's channels?
In that case you'll need to use a for loop to read the channels for that pixel, sum up these numbers, and divide by the number of channels. (You only need to do this loop if the number of channels is more than 4.)
Also note that your 1e-8 needs to be inside the sqrt() or at least in the denominator.

Physically based camera values too small

I am currently working on a physically based camera model and came across this blog: https://placeholderart.wordpress.com/2014/11/21/implementing-a-physically-based-camera-manual-exposure/
So I tried to implement it myself in OpenGL. I thought of calculating the exposure using the function getSaturationBasedExposure and pass that value to a shader where I will multiply the final color with that value:
float getSaturationBasedExposure(float aperture,
float shutterSpeed,
float iso)
{
float l_max = (7800.0f / 65.0f) * Sqr(aperture) / (iso * shutterSpeed);
return 1.0f / l_max;
}
colorOut = color * exposure;
But the values I get from that function are way too small (like around 0.00025 etc) so I guess I am missunderstanding the returned value of that function.
In the blog a test scene is mentioned in which the scene luminance is around 4000, but I haven't seen a shader implementation working with color range from 0 to 4000+ (not even HDR goes that high, right?).
So could anyone explain me how to apply the calculations correctly to a OpenGL scene or help me understand the meaning behind the calculations?

Uniform affecting shader flow and performances

i was experimenting with OpenGL fragment shaders by doing a huge blur (300*300) done in two passes, one horizontal, one vertical.
I noticed that passing the direction as a uniform (vec2) is about 10 time slower than to directly write it in the code (140 to 12 fps).
ie:
vec2 dir = vec2(0, 1) / textureSize(tex, 0);
int size = 150;
for(int i = -size; i != size; ++i) {
float w = // compute weight here...
acc += w * texture(tex, + coord + vec2(i) * dir);
}
appear to be faster than:
uniform vec2 dir;
/*
...
*/
int size = 150;
for(int i = -size; i != size; ++i) {
float w = // compute weight here...
acc += w * texture(tex, + coord + vec2(i) * dir);
}
Creating two programs with different uniforms doesn't change anything.
Does anyone know why there is such a huge difference and why doesn't the driver see that "inlining" dir might be much faster ?
EDIT : Taking size as a uniform also have an impact, but not as much as dir.
If you are interested in seeing what it looks like (FRAPS provides the fps counter):
uniform blur.
"inline" blur.
no blur.
Quick notes : i am running on a nVidia 760M GTX using OpenGL 4.2 and glsl 420. Also puush's jpeg is responsible for the colors in the images.
A good guess would be that the UBOs are stored in shared memory, but might require an occasional round-trip to global memory (vram), while the non-uniform version stores that little piece of data in registers or constant memory.
However, since the OpenGL standard does not dictate where your data is stored, you would have to look at a profiler, and try to gain better understanding of how NVIDIA's GL implementation works.
I'd recommend, you start by profiling, using NVIDIA PerfKit or NVIDIA NSIGHT for VS. Even if you think, it's too much trouble for now. If you want to write high-performance code, you should start getting used to the process. You will see how easy it gets eventually.
EDIT:
So why is it so much slower? Because in this case, one failed optimization (data not in registers) can cause other (if not most other) optimizations to also fail. And, coincidentally, optimizations are absolutely necessary for GPU code to run fast.

Linear sampled Gaussian blur quality issue

I recently implemented a linear sampled gaussian blur based on this article: Linear Sampled Gaussian Blur
It generally came out well, however it appears there is slight aliasing on text and thinner collections of pixels. I'm pretty stumped as to what is causing this, is it an issue with my shader or weight calculations or is it an inherit draw back of using this method?
I'd like to add that I don't run into this issue when I sample each pixel regularly instead of using bilinear filtering.
Any insights are much appreciated. Here's a code sample of how I work out my weights:
int support = int(sigma * 3.0f);
float total = 0.0f;
weights.push_back(exp(-(0*0)/(2*sigma*sigma))/(sqrt(2*constants::pi)*sigma));
total += weights.back();
offsets.push_back(0);
for (int i = 1; i <= support; i++)
{
float w1 = exp(-(i*i)/(2*sigma*sigma))/(sqrt(2*constants::pi)*sigma);
float w2 = exp(-((i+1)*(i+1))/(2*sigma*sigma))/(sqrt(2*constants::pi)*sigma);
weights.push_back(w1 + w2);
total += 2.0f * weights[i];
offsets.push_back((i * w1 + (i + 1) * w2) / weights[i]);
}
for (int i = 0; i < support; i++)
{
weights[i] /= total;
}
And here is the fragment shader (there is another vertical version of this shader too):
void main()
{
vec3 acc = texture2D(tex_object, v_tex_coord.st).rgb*weights[0];
for (int i = 1; i < NUM_SAMPLES; i++)
{
acc += texture2D(tex_object, (v_tex_coord.st+(vec2(offsets[i], 0.0)/tex_size))).rgb*weights[i];
acc += texture2D(tex_object, (v_tex_coord.st-(vec2(offsets[i], 0.0)/tex_size))).rgb*weights[i];
}
gl_FragColor = vec4(acc, 1.0);
Here is a screenshot depicting the issue:
This looks like a correct gaussian blur to me. The extent to which text is disrupted depends on your sigma. What value are you using?
Also I would check the scaling matrix for the projection you are using.
If you want to blur but without affecting text and thin pixel lines, you might think of
compositing the result with the output of a mild high-pass filter
use a smaller sigma
change the shape of the kernel so it's not gaussian: rather than exp(-i*i/s*s), you might try a function with higher excess kurtosis. You could try a linear up/down function, or one of the functions listed on this page instead: http://en.wikipedia.org/wiki/Kurtosis . They will all lead to blurs with varying degrees of disrupting fine detail.
This is an inherent issue with the bilinear filtering. It's unavoidable.