glsl performance: tan(acos(x)) vs sqrt(1-x*x)/x - opengl

I am writing a glsl fragment shader in which I use shadow mapping. Following this tutorial http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-16-shadow-mapping/ , I wrote this line to evaluate the shaodw bias to avoid the shadow acne
float bias = 0.005 * tan( acos ( N_L_dot ) );
But I know from math that
tan ( acos ( x ) = sqrt ( 1 - x^2 ) / x
Would it be faster to use such identity instead of tan and acos? In practice, to use this line of code
float bias = 0.005 * sqrt ( 1.f - N_L_dot * N_L_dot ) / N_L_dot ;
I think my question is something like "Is the gpu faster at doing sqrt and divisions or tan and acos?"
...or am I missing something?

Using AMD GPU Shader Analyzer it showed that float bias = 0.005 * sqrt ( 1.f - N_L_dot * N_L_dot ) / N_L_dot ;
Will generate fewer clock cycle instructions in the shader assembly ( 4 instructions estimating 4 clock cycles).
Where the float bias = 0.005 * tan( acos ( N_L_dot ) ); generated 15 instructions estimating 8 clock cycles to complete.
I ran the two different methods against the Radeon HD 6450 Assembly code. But the results seemed to track well for the different Radeon HD cards.
Looks like the sqrt method will generally perform better.

Related

Artifacts when computing contours of a zoomed in texture

I have a shader that calculates contour lines based on a parameter in the program, in my case the height value of a mesh. The calculation is performed using standard derivatives as follows:
float contourWidth = 0.5;
float f = abs( fract( height ) - 0.5 );
float df = fwidth( height );
float mi = max( 0.0, contourWidth - 1.0 );
float ma = max( 1.0 , contourWidth );
float contour = clamp( (f - df * mi ) / ( df * ( ma - mi ) ), 0.0, 1.0 );
This works as expected, however when I feed the height parameter from a texture, and zoom in, such that the rendered pixels are much smaller than the sampled texels, artifacts begin to appear.
The sampled texture has linear filtering and to investigate I implemented linear filtering manually in the shader to try to isolate the problem. This resolved the issue, but I'd like to understand why this is happening and if the only solution is to manually implement linear filtering in the shader as I have, or if there is a better way.
Below is a comparison of the two rendering techniques:
I have created an working example on Shadertoy to demonstrate the issue: https://www.shadertoy.com/view/MljcDy
I'm seeing this issue on Mac OSX as well as mobile Safari (where the atrifacts are even worse)

Accuracy warnings in scipy.special

I am running an MCMC sampler which requires the calculation of the hypergeometric function at each step using scipy.special.hyp2f1().
At certain points on my grid (which I do not care about) the solutions to the hypergeometric function are quite unstable and SciPy prints the warning:
Warning! You should check the accuracy
This is rather annoying, and over 1000s of samples may well slow down my routine.
I have tried using special.errprint(0) with no luck, as well as disabling all warnings in Python using both the warnings module and the -W ignore flag.
The offending function (called from another file) is below
from numpy import pi, hypot, real, imag
import scipy.special as special
def deflection_angle(p, (x1, x2)):
# Find the normalisation constant
norm = (p.f * p.m * (p.r0 ** (t - 2.0)) / pi) ** (1.0 / t)
# Define the complex plane
z = x1 + 1j * x2
# Define the radial coordinates
r = hypot(x1, x2)
# Truncate the radial coordinates
r_ = r * (r < p.r0).astype('float') + p.r0 * (r >= p.r0).astype('float')
# Calculate the radial part
radial = (norm ** 2 / (p.f * z)) * ((norm / r_) ** (t - 2))
# Calculate the angular part
h1, h2, h3 = 0.5, 1.0 - t / 2.0, 2.0 - t / 2.0
h4 = ((1 - p.f ** 2) / p.f ** 2) * (r_ / z) ** 2
special.errprint(0)
angular = special.hyp2f1(h1, h2, h3, h4)
# Assemble the deflection angle
alpha = (- radial * angular).conjugate()
# Separate real and imaginary parts
return real(alpha), imag(alpha)`
Unfortunately, hyp2f1 is notoriously hard to compute over some non-trivial areas of the parameter space. Many implementations would dilently produce inaccurate or wildly wrong results. Scipy.special tries hard to at least monitor convergence. An alternative could be to usr arbitrary precision implementations, e.g. mpmath. But these would certainly be quite a bit slower, so MCMC users beware.
EDIT: Ok, this seems to be scipy version dependent. I tried #wrwrwr's example on scipy 0.13.3, and it reproduces what you see: "Warning! You should check the accuracy" is printed regardless of the errprint status. However, doing the same with the dev version, I get
In [12]: errprint(True)
Out[12]: 0
In [13]: hyp2f1(0.5, 2/3., 1.5, 0.09j+0.75j)
/home/br/virtualenvs/scipy_py27/bin/ipython:1: SpecialFunctionWarning: scipy.special/chyp2f1: loss of precision
#!/home/br/virtualenvs/scipy_py27/bin/python
Out[13]: (0.93934867949609357+0.15593972567482395j)
In [14]: errprint(False)
Out[14]: 1
In [15]: hyp2f1(0.5, 2/3., 1.5, 0.09j+0.75j)
Out[15]: (0.93934867949609357+0.15593972567482395j)
So, apparently it got fixed at some point between 2013 and now. You might want to upgrade your scipy version.

Why my Gradient is wrong (Coursera, Logistic Regression, Julia)?

I'm trying to do Logistic Regression from Coursera in Julia, but it doesn't work.
The Julia code to calculate the Gradient:
sigmoid(z) = 1 / (1 + e ^ -z)
hypotesis(theta, x) = sigmoid(scalar(theta' * x))
function gradient(theta, x, y)
(m, n) = size(x)
h = [hypotesis(theta, x[i,:]') for i in 1:m]
g = Array(Float64, n, 1)
for j in 1:n
g[j] = sum([(h[i] - y[i]) * x[i, j] for i in 1:m])
end
g
end
If this gradient used it produces the wrong results. Can't figure out why, the code seems like the right one.
The full Julia script. In this script the optimal Theta calculated using my Gradient Descent implementation and using the built-in Optim package, and the results are different.
The gradient is correct (up to a scalar multiple, as #roygvib points out). The problem is with the gradient descent.
If you look at the values of the cost function during your gradient descent, you will see a lot of NaN,
which probably come from the exponential:
lowering the step size (e.g., to 1e-5) will avoid the overflow,
but you will have to increase the number of iterations a lot (perhaps to 10_000_000).
A better (faster) solution would be to let the step size vary.
For instance, one could multiply the step size by 1.1
if the cost function improves after a step
(the optimum still looks far away in this direction: we can go faster),
and divide it by 2 if it does not (we went too fast and ended up past the minimum).
One could also do a line search in the direction of the gradient to find the best step size
(but this is time-consuming and can be replaced by approximations, e.g., Armijo's rule).
Rescaling the predictive variables also helps.
I tried comparing gradient() in the OP's code with numerical derivative of cost_j() (which is the objective function of minimization) using the following routine
function grad_num( theta, x, y )
g = zeros( 3 )
eps = 1.0e-6
disp = zeros( 3 )
for k = 1:3
disp[:] = theta[:]
disp[ k ]= theta[ k ] + eps
plus = cost_j( disp, x, y )
disp[ k ]= theta[ k ] - eps
minus = cost_j( disp, x, y )
g[ k ] = ( plus - minus ) / ( 2.0 * eps )
end
return g
end
But the gradient values obtained from the two routines do no seem to agree very well (at least for the initial stage of minimization)... So I manually derived the gradient of cost_j( theta, x, y ), from which it seems that the division by m is missing:
#/ OP's code
# g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] )
#/ modified code
g[j] = sum( [ (h[i] - y[i]) * x[i, j] for i in 1:m ] ) / m
Because I am not very sure if the above code and expression are really correct, could you check them by yourself...?
But in fact, regardless of whether I use the original or corrected gradients, the program converges to the same minimum value (0.2034977016, almost the same as obtained from Optim), because the two gradients differ only by a multiplicative factor! Because the convergence was very slow, I also modified the stepsize alpha adaptively following the suggestion by Vincent (here I used more moderate values for acceleration/deceleration):
function gradient_descent(x, y, theta, alpha, n_iterations)
...
c = cost_j( theta, x, y )
for i = 1:n_iterations
c_prev = c
c = cost_j( theta, x, y )
if c - c_prev < 0.0
alpha *= 1.01
else
alpha /= 1.05
end
theta[:] = theta - alpha * gradient(theta, x, y)
end
...
end
and called this routine as
optimal_theta = gradient_descent( x, y, [0 0 0]', 1.5e-3, 10^7 )[ 1 ]
The variation of cost_j versus iteration steps is plotted below.

Make very small, (or large), exponential calculations

Exponential limit of most 32 bit machines is
exp( +/-700 )
But I would like to do an exponential calculation
res = exp( x ) / exp( d )
when x or d are bigger than 700 I use the fact that
exp( x + y ) = exp( x ) . exp( y )
So my calculation would be something along the line of
res = (exp( x - z ).exp(z)) / (exp( d - z ).exp(z))
or
res = exp( x - z ) / exp( d - z )
where (x-z) < 700
But this approach is flawed in some cases, for example where x = 6000 and d = 10000
If we use z=5300 then
res = exp( 6000 - 5300 ) / exp( 10000 - 5300 )
res = exp( 700 ) / exp( 47000 )
But exp( 47000 ) = 0 on a 32 bit machine.
If I replace z = 9300 then I get the opposite effect.
res = exp( -3300 ) / exp( 700 )
So how could I solve the above equations, (that should return a 32bit valid number I think), given the limitations of the computer?
Edit
The reason for doing this is I am using the formula
P( a ) = P(y1) * P(y2) * P(y3) ... P(yN)
In order to prevent overflow I then do
a = log( P(y1) ) + log( P(y2) ) + log (P(y3)) ... log( P(yN) )
b = log( P(z1) ) + log( P(z2) ) + log (P(z3)) ... log( P(zN) )
...
z = log( P(zz1) ) + log( P(zz2) ) + log (P(zz3)) ... log( P(zzN) )
to get a total I do
total = a + b ... z
and to calculate the percentage I do
(exp(a) / exp( total ) ) * 100
but it is possible that "a" and/or "total" are greater than 700
I guess the question could be how could I calculate the percentage without using the exponential
It doesn't matter that the answer should be a 32 bit number if some of the intermediate steps in the calculations aren't.
For math that goes outside the bounds of an int or long type, you probably need to start using something like GMP.
https://gmplib.org/
I assume that you want to compute this:
p = exp(a) / exp(b)
And since a^b/a^c == a^(b-c) this reduces to
p = exp(a - b)
which can be easily computed if that difference is below that critical exponent.
If it isn't, then your result cannot be represented by primitive datatypes like double (because it's either extremely large or extremely small), you then need some kind of arbitrary precision numbers, probably provided by some library.
But if you only need to print the result, or store it somehow, then you can easily compute even extremely large numbers:
For that, you change to base 10 (for displaying), compute the equivalent exponent therefore (tExponent = log10(eExponent)), and get that value into the allowed range between std::numeric_limits::max_exponent10 and std::numeric_limits::min_exponent10, saving the difference as scaling factor.
For now, I just have a quick and dirty live example, showing
exp(90000) / exp(100) = 1.18556 scaled by 10^39043
(Check at wolfram alpha)
Note: When I wrote this, it was pretty late in the evening. I'm leaving this here for an "alternative" approach.
Now, generally, there's
a^b = [a^(b/c)]^c
And since
(a/b)^c = (a^c)/(b^c)
holds, too, I guess the easiest approach here is to just divide both exponents as long as one of them is above your critical value, then do the exponentiation, divide the results, and finally use the divisor of the former exponents as exponent for the quotient:
double large_exp_quot(
double eNum,
double eDenom,
unsigned int const critical = 200) {
if (abs(eNum - eDenom) > critical) {
throw out_of_range{"That won't work, resulting exponent is too large"};
}
unsigned int eDivisor = 1;
while (abs(eNum) > critical or abs(eDenom) > critical) {
eNum /= 2;
eDenom /= 2;
eDivisor *= 2;
}
return pow(exp(eNum) / exp(eDenom), eDivisor);
}
But this will only work, if the result of your computation can actually be stored using the C++ primitive datatypes, in this case double. The example you gave ... with exponents 6000 and 10000 ... is obviously not representable with a double (it's e^(-4000) and thus incredibly small)
Numerically unstable computation: exp(a) / exp(b)
Equivalent stable computation: exp(a - b)
Numerically unstable computation: Πi=1..n pi
Equivalent stable computation: exp(Σi=1..n log(pi))

Newton Raphson with SSE2 - can someone explain me these 3 lines

I'm reading this document: http://software.intel.com/en-us/articles/interactive-ray-tracing
and I stumbled upon these three lines of code:
The SIMD version is already quite a bit faster, but we can do better.
Intel has added a fast 1/sqrt(x) function to the SSE2 instruction set.
The only drawback is that its precision is limited. We need the
precision, so we refine it using Newton-Rhapson:
__m128 nr = _mm_rsqrt_ps( x );
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr );
result = _mm_mul_ps( _mm_mul_ps( half, nr ), _mm_sub_ps( three, muls ) );
This code assumes the existence of a __m128 variable named 'half'
(four times 0.5f) and a variable 'three' (four times 3.0f).
I know how to use Newton Raphson to calculate a function's zero and I know how to use it to calculate the square root of a number but I just can't see how this code performs it.
Can someone explain it to me please?
Given the Newton iteration , it should be quite straight forward to see this in the source code.
__m128 nr = _mm_rsqrt_ps( x ); // The initial approximation y_0
__m128 muls = _mm_mul_ps( _mm_mul_ps( x, nr ), nr ); // muls = x*nr*nr == x(y_n)^2
result = _mm_mul_ps(
_mm_sub_ps( three, muls ) // this is 3.0 - mul;
/*multiplied by */ __mm_mul_ps(half,nr) // y_0 / 2 or y_0 * 0.5
);
And to be precise, this algorithm is for the inverse square root.
Note that this still doesn't give fully a fully accurate result. rsqrtps with a NR iteration gives almost 23 bits of accuracy, vs. sqrtps's 24 bits with correct rounding for the last bit.
The limited accuracy is an issue if you want to truncate the result to integer. (int)4.99999 is 4. Also, watch out for the x == 0.0 case if using sqrt(x) ~= x * sqrt(x), because 0 * +Inf = NaN.
To compute the inverse square root of a, Newton's method is applied to the equation 0=f(x)=a-x^(-2) with derivative f'(x)=2*x^(-3) and thus the iteration step
N(x) = x - f(x)/f'(x) = x - (a*x^3-x)/2
= x/2 * (3 - a*x^2)
This division-free method has -- in contrast to the globally converging Heron's method -- a limited region of convergence, so you need an already good approximation of the inverse square root to get a better approximation.