multiplication two floats variables on CUDA - c++

I have really interesting problem, but I am solving it for 3 hours and I just can't figure out what is going on and why it isn't working. I tried google it, but with no results.
I am coding program on CUDA. I have this really simple piece of code:
__global__ void calcErrorOutputLayer_kernel(*arguments...*)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float gradient;
float derivation;
derivation = pow((2/(pow(euler, neuron_device[startIndex + idx].outputValue) +
pow(euler, -neuron_device[startIndex + idx].outputValue))), 2);
gradient = (backVector_device[idx] - neuron_device[startIndex + idx].outputValue);
gradient = gradient * derivation; //this line doesn't work
gradient = gradient * 2.0; //this line works
ok, so gradient is calculated correctly and also derivation. but when comes line, where should be these two variables multiplicated with each other nothing happens (value of gradient isn't changed) and on next line CUDA debugger tells me that: " 'derivation' has no value at the target location "
gradient * 2.0 works correctly and it change value of gradient 2 times.
Can anyone help me please?

a = pow(euler, neuron_device[startIndex + idx].outputValue);
b = pow(euler, -neuron_device[startIndex + idx].outputValue);
derivation = pow((2/(a + b),2);
Pow gives an error when:
the base is negative and exponent is not an integral value, or
the base is zero and the exponent is negative, a domain error occurs, setting the global variable errno to the value EDOM.
I guess that you are facing precision problems, and both 'a' and 'b' are 0. You probably are getting derivation = 0 or "inf".
Can you change floats to doubles?

Related

GLSL log returning an undefined result

I'm trying to draw the mandelbrot set. I've created the algorithm on the CPU, but now I want to reproduce it on the GPU, but the code behaves differently.
In the CPU program, at one point, I take the std::abs(z), where z is a complex number, and write the value in the green channel on the screen.
On the GPU, I take the same z and call the following function (Vulkan, GLSL):
double module(dvec2 z) {
return sqrt(z.x * z.x + z.y * z.y);
}
double color(z) {
return module(z);
}
When I write color(z) into the green channel, I get the same exact picture as I get for the CPU program, so the code works exactly the same, at least up to that point.
Next, I changed the CPU code to instead take std::log(std::abs(z)) / 20 and put that in the green channel. This is the image I get (nubmers that are in the mandelbrot set are coloured white):
You can see that the green is never clipped, so the result for the each pixel is somewhere in the range (0, 1).
I then changed the GPU code to this:
double module(dvec2 z) {
return sqrt(z.x * z.x + z.y * z.y);
}
double color(z) {
return log(module(z));
}
I wrote color(z) / 20 into the green channel. This is the resulting image:
As you can see, the value of color(z) / 20 must be <=0. I tried changing the color function to this:
double color(z) {
return -log(module(z));
}
To see if the value was 0 or negative. I still got the same image, so the value must be 0. To confirm this I change the code again, now to this:
double color(z) {
return log(module(z)) + 0.5;
}
and wrote color(z) to the green channel (dropping the division by 20). I expected the result to be a medium green colour.
To my surprise, the image did not change, the pixels were still pitch black.
Perplexed, I reverted the change to the original:
double color(z) {
return log(module(z));
}
but, I wrote color(z) + 0.5 into the green channel and I got this:
To summarize, it seems that log(module(z)) is returning some undefined value. If you negate it or try to add anything to it, it remains undefined. When this value is return from a function that has a double as the return type, the value returned is 0, which can now be added to.
Why does this happen? The function module(z) is guaranteed to return a positive number so the log function should return a valid result. The definitions of both std::log and GLSL log are the natural logarithm of the argument, so the value should be exactly the same (ignoring the precision error).
How do I make GLSL log behave properly?
It turns out that GPU doesn't really like when you ask it to calculate a log of a very large number. From what I gather, log (actually ln) is implemented as the taylor series. This is unfortunate because it contains polynomials to the n-th power for n members.
However, if you have a number represented as x = mantissa * 2^exp, you can get ln(x) from the following formula:
ln(x) = exp * ln(2) + ln(mantissa)
Whatever x is, mantissa should be significantly smaller. Here's a function for the fragment shader:
float ln(float z) {
int integerValue = floatBitsToInt(z);
int exp = ((integerValue >> mantissaBits) & (1 << expBits) - 1)
- ((1 << (expBits - 1)) - 1);
integerValue |= ((1 << expBits) - 1) << mantissaBits;
integerValue &= ~(1 << (mantissaBits + expBits - 1));
return exp * log2 + log(intBitsToFloat(integerValue));
}
Note that in GLSL this trick only works with floats - there is not 64bit integral type and thus not doubleBitsToLong or vice versa.

How i can make matlab precision to be the same as in c++?

I have problem with precision. I have to make my c++ code to have same precision as matlab. In matlab i have script which do some stuff with numbers etc. I got code in c++ which do the same as that script. Output on the same input is diffrent :( I found that in my script when i try 104 >= 104 it returns false. I tried to use format long but it did not help me to find out why its false. Both numbers are type of double. i thought that maybe matlab stores somewhere the real value of 104 and its for real like 103.9999... So i leveled up my precision in c++. It also didnt help because when matlab returns me value of 50.000 in c++ i got value of 50.050 with high precision. Those 2 values are from few calculations like + or *. Is there any way to make my c++ and matlab scrips have same precision?
for i = 1:neighbors
y = spoints(i,1)+origy;
x = spoints(i,2)+origx;
% Calculate floors, ceils and rounds for the x and y.
fy = floor(y); cy = ceil(y); ry = round(y);
fx = floor(x); cx = ceil(x); rx = round(x);
% Check if interpolation is needed.
if (abs(x - rx) < 1e-6) && (abs(y - ry) < 1e-6)
% Interpolation is not needed, use original datatypes
N = image(ry:ry+dy,rx:rx+dx);
D = N >= C;
else
% Interpolation needed, use double type images
ty = y - fy;
tx = x - fx;
% Calculate the interpolation weights.
w1 = (1 - tx) * (1 - ty);
w2 = tx * (1 - ty);
w3 = (1 - tx) * ty ;
w4 = tx * ty ;
%Compute interpolated pixel values
N = w1*d_image(fy:fy+dy,fx:fx+dx) + w2*d_image(fy:fy+dy,cx:cx+dx) + ...
w3*d_image(cy:cy+dy,fx:fx+dx) + w4*d_image(cy:cy+dy,cx:cx+dx);
D = N >= d_C;
end
I got problems in else which is in line 12. tx and ty eqauls 0.707106781186547 or 1 - 0.707106781186547. Values from d_image are in range 0 and 255. N is value 0..255 of interpolating 4 pixels from image. d_C is value 0.255. Still dunno why matlab shows that when i have in N vlaues like: x x x 140.0000 140.0000 and in d_C: x x x 140 x. D gives me 0 on 4th position so 140.0000 != 140. I Debugged it trying more precision but it still says that its 140.00000000000000 and it is still not 140.
int Codes::Interpolation( Point_<int> point, Point_<int> center , Mat *mat)
{
int x = center.x-point.x;
int y = center.y-point.y;
Point_<double> my;
if(x<0)
{
if(y<0)
{
my.x=center.x+LEN;
my.y=center.y+LEN;
}
else
{
my.x=center.x+LEN;
my.y=center.y-LEN;
}
}
else
{
if(y<0)
{
my.x=center.x-LEN;
my.y=center.y+LEN;
}
else
{
my.x=center.x-LEN;
my.y=center.y-LEN;
}
}
int a=my.x;
int b=my.y;
double tx = my.x - a;
double ty = my.y - b;
double wage[4];
wage[0] = (1 - tx) * (1 - ty);
wage[1] = tx * (1 - ty);
wage[2] = (1 - tx) * ty ;
wage[3] = tx * ty ;
int values[4];
//wpisanie do tablicy 4 pixeli ktore wchodza do interpolacji
for(int i=0;i<4;i++)
{
int val = mat->at<uchar>(Point_<int>(a+help[i].x,a+help[i].y));
values[i]=val;
}
double moze = (wage[0]) * (values[0]) + (wage[1]) * (values[1]) + (wage[2]) * (values[2]) + (wage[3]) * (values[3]);
return moze;
}
LEN = 0.707106781186547 Values in array values are 100% same as matlab values.
Matlab uses double precision. You can use C++'s double type. That should make most things similar, but not 100%.
As someone else noted, this is probably not the source of your problem. Either there is a difference in the algorithms, or it might be something like a library function defined differently in Matlab and in C++. For example, Matlab's std() divides by (n-1) and your code may divide by n.
First, as a rule of thumb, it is never a good idea to compare floating point variables directly. Instead of, for example instead of if (nr >= 104) you should use if (nr >= 104-e), where e is a small number, like 0.00001.
However, there must be some serious undersampling or rounding error somewhere in your script, because getting 50050 instead of 50000 is not in the limit of common floating point imprecision. For example, Matlab can have a step of as small as 15 digits!
I guess there are some casting problems in your code, for example
int i;
double d;
// ...
d = i/3 * d;
will will give a very inaccurate result, because you have an integer division. d = (double)i/3 * d or d = i/3. * d would give a much more accurate result.
The above example would NOT cause any problems in Matlab, because there everything is already a floating-point number by default, so a similar problem might be behind the differences in the results of the c++ and Matlab code.
Seeing your calculations would help a lot in finding what went wrong.
EDIT:
In c and c++, if you compare a double with an integer of the same value, you have a very high chance that they will not be equal. It's the same with two doubles, but you might get lucky if you perform the exact same computations on them. Even in Matlab it's dangerous, and maybe you were just lucky that as both are doubles, both got truncated the same way.
By you recent edit it seems, that the problem is where you evaluate your array. You should never use == or != when comparing floats or doubles in c++ (or in any languages when you use floating-point variables). The proper way to do a comparison is to check whether they are within a small distance of each other.
An example: using == or != to compare two doubles is like comparing the weight of two objects by counting the number of atoms in them, and deciding that they are not equal even if there is one single atom difference between them.
MATLAB uses double precision unless you say otherwise. Any differences you see with an identical implementation in C++ will be due to floating-point errors.

lagrange approximation -c++

I updated the code.
What i am trying to do is to hold every lagrange's coefficient values in pointer d.(for example for L1(x) d[0] would be "x-x2/x1-x2" ,d1 would be (x-x2/x1-x2)*(x-x3/x1-x3) etc.
My problem is
1) how to initialize d ( i did d[0]=(z-x[i])/(x[k]-x[i]) but i think it's not right the "d[0]"
2) how to initialize L_coeff. ( i am using L_coeff=new double[0] but am not sure if it's right.
The exercise is:
Find Lagrange's polynomial approximation for y(x)=cos(π x), x ∈−1,1 using 5 points
(x = -1, -0.5, 0, 0.5, and 1).
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <cmath>
using namespace std;
const double pi=3.14159265358979323846264338327950288;
// my function
double f(double x){
return (cos(pi*x));
}
//function to compute lagrange polynomial
double lagrange_polynomial(int N,double *x){
//N = degree of polynomial
double z,y;
double *L_coeff=new double [0];//L_coefficients of every Lagrange L_coefficient
double *d;//hold the polynomials values for every Lagrange coefficient
int k,i;
//computations for finding lagrange polynomial
//double sum=0;
for (k=0;k<N+1;k++){
for ( i=0;i<N+1;i++){
if (i==0) continue;
d[0]=(z-x[i])/(x[k]-x[i]);//initialization
if (i==k) L_coeff[k]=1.0;
else if (i!=k){
L_coeff[k]*=d[i];
}
}
cout <<"\nL("<<k<<") = "<<d[i]<<"\t\t\tf(x)= "<<f(x[k])<<endl;
}
}
int main()
{
double deg,result;
double *x;
cout <<"Give the degree of the polynomial :"<<endl;
cin >>deg;
for (int i=0;i<deg+1;i++){
cout <<"\nGive the points of interpolation : "<<endl;
cin >> x[i];
}
cout <<"\nThe Lagrange L_coefficients are: "<<endl;
result=lagrange_polynomial(deg,x);
return 0;
}
Here is an example of lagrange polynomial
As this seems to be homework, I am not going to give you an exhaustive answer, but rather try to send you on the right track.
How do you represent polynomials in a computer software? The intuitive version you want to archive as a symbolic expression like 3x^3+5x^2-4 is very unpractical for further computations.
The polynomial is defined fully by saving (and outputting) it's coefficients.
What you are doing above is hoping that C++ does some algebraic manipulations for you and simplify your product with a symbolic variable. This is nothing C++ can do without quite a lot of effort.
You have two options:
Either use a proper computer algebra system that can do symbolic manipulations (Maple or Mathematica are some examples)
If you are bound to C++ you have to think a bit more how the single coefficients of the polynomial can be computed. You programs output can only be a list of numbers (which you could, of course, format as a nice looking string according to a symbolic expression).
Hope this gives you some ideas how to start.
Edit 1
You still have an undefined expression in your code, as you never set any value to y. This leaves prod*=(y-x[i])/(x[k]-x[i]) as an expression that will not return meaningful data. C++ can only work with numbers, and y is no number for you right now, but you think of it as symbol.
You could evaluate the lagrange approximation at, say the value 1, if you would set y=1 in your code. This would give you the (as far as I can see right now) correct function value, but no description of the function itself.
Maybe you should take a pen and a piece of paper first and try to write down the expression as precise Math. Try to get a real grip on what you want to compute. If you did that, maybe you come back here and tell us your thoughts. This should help you to understand what is going on in there.
And always remember: C++ needs numbers, not symbols. Whenever you have a symbol in an expression on your piece of paper that you do not know the value of you can either find a way how to compute the value out of the known values or you have to eliminate the need to compute using this symbol.
P.S.: It is not considered good style to post identical questions in multiple discussion boards at once...
Edit 2
Now you evaluate the function at point y=0.3. This is the way to go if you want to evaluate the polynomial. However, as you stated, you want all coefficients of the polynomial.
Again, I still feel you did not understand the math behind the problem. Maybe I will give you a small example. I am going to use the notation as it is used in the wikipedia article.
Suppose we had k=2 and x=-1, 1. Furthermore, let my just name your cos-Function f, for simplicity. (The notation will get rather ugly without latex...) Then the lagrangian polynomial is defined as
f(x_0) * l_0(x) + f(x_1)*l_1(x)
where (by doing the simplifications again symbolically)
l_0(x)= (x - x_1)/(x_0 - x_1) = -1/2 * (x-1) = -1/2 *x + 1/2
l_1(x)= (x - x_0)/(x_1 - x_0) = 1/2 * (x+1) = 1/2 * x + 1/2
So, you lagrangian polynomial is
f(x_0) * (-1/2 *x + 1/2) + f(x_1) * 1/2 * x + 1/2
= 1/2 * (f(x_1) - f(x_0)) * x + 1/2 * (f(x_0) + f(x_1))
So, the coefficients you want to compute would be 1/2 * (f(x_1) - f(x_0)) and 1/2 * (f(x_0) + f(x_1)).
Your task is now to find an algorithm that does the simplification I did, but without using symbols. If you know how to compute the coefficients of the l_j, you are basically done, as you then just can add up those multiplied with the corresponding value of f.
So, even further broken down, you have to find a way to multiply the quotients in the l_j with each other on a component-by-component basis. Figure out how this is done and you are a nearly done.
Edit 3
Okay, lets get a little bit less vague.
We first want to compute the L_i(x). Those are just products of linear functions. As said before, we have to represent each polynomial as an array of coefficients. For good style, I will use std::vector instead of this array. Then, we could define the data structure holding the coefficients of L_1(x) like this:
std::vector L1 = std::vector(5);
// Lets assume our polynomial would then have the form
// L1[0] + L2[1]*x^1 + L2[2]*x^2 + L2[3]*x^3 + L2[4]*x^4
Now we want to fill this polynomial with values.
// First we have start with the polynomial 1 (which is of degree 0)
// Therefore set L1 accordingly:
L1[0] = 1;
L1[1] = 0; L1[2] = 0; L1[3] = 0; L1[4] = 0;
// Of course you could do this more elegant (using std::vectors constructor, for example)
for (int i = 0; i < N+1; ++i) {
if (i==0) continue; /// For i=0, there will be no polynomial multiplication
// Otherwise, we have to multiply L1 with the polynomial
// (x - x[i]) / (x[0] - x[i])
// First, note that (x[0] - x[i]) ist just a scalar; we will save it:
double c = (x[0] - x[i]);
// Now we multiply L_1 first with (x-x[1]). How does this multiplication change our
// coefficients? Easy enough: The coefficient of x^1 for example is just
// L1[0] - L1[1] * x[1]. Other coefficients are done similary. Futhermore, we have
// to divide by c, which leaves our coefficient as
// (L1[0] - L1[1] * x[1])/c. Let's apply this to the vector:
L1[4] = (L1[3] - L1[4] * x[1])/c;
L1[3] = (L1[2] - L1[3] * x[1])/c;
L1[2] = (L1[1] - L1[2] * x[1])/c;
L1[1] = (L1[0] - L1[1] * x[1])/c;
L1[0] = ( - L1[0] * x[1])/c;
// There we are, polynomial updated.
}
This, of course, has to be done for all L_i Afterwards, the L_i have to be added and multiplied with the function. That is for you to figure out. (Note that I made quite a lot of inefficient stuff up there, but I hope this helps you understanding the details better.)
Hopefully this gives you some idea how you could proceed.
The variable y is actually not a variable in your code but represents the variable P(y) of your lagrange approximation.
Thus, you have to understand the calculations prod*=(y-x[i])/(x[k]-x[i]) and sum+=prod*f not directly but symbolically.
You may get around this by defining your approximation by a series
c[0] * y^0 + c[1] * y^1 + ...
represented by an array c[] within the code. Then you can e.g. implement multiplication
d = c * (y-x[i])/(x[k]-x[i])
coefficient-wise like
d[i] = -c[i]*x[i]/(x[k]-x[i]) + c[i-1]/(x[k]-x[i])
The same way you have to implement addition and assignments on a component basis.
The result will then always be the coefficients of your series representation in the variable y.
Just a few comments in addition to the existing responses.
The exercise is: Find Lagrange's polynomial approximation for y(x)=cos(π x), x ∈ [-1,1] using 5 points (x = -1, -0.5, 0, 0.5, and 1).
The first thing that your main() does is to ask for the degree of the polynomial. You should not be doing that. The degree of the polynomial is fully specified by the number of control points. In this case you should be constructing the unique fourth-order Lagrange polynomial that passes through the five points (xi, cos(π xi)), where the xi values are those five specified points.
const double pi=3.1415;
This value is not good for a float, let alone a double. You should be using something like const double pi=3.14159265358979323846264338327950288;
Or better yet, don't use pi at all. You should know exactly what the y values are that correspond to the given x values. What are cos(-π), cos(-π/2), cos(0), cos(π/2), and cos(π)?

sin and cos are slow, is there an alternatve?

My game needs to move by a certain angle. To do this I get the vector of the angle via sin and cos. Unfortunately sin and cos are my bottleneck. I'm sure I do not need this much precision. Is there an alternative to a C sin & cos and look-up table that is decently precise but very fast?
I had found this:
float Skeleton::fastSin( float x )
{
const float B = 4.0f/pi;
const float C = -4.0f/(pi*pi);
float y = B * x + C * x * abs(x);
const float P = 0.225f;
return P * (y * abs(y) - y) + y;
}
Unfortunately, this does not seem to work. I get significantly different behavior when I use this sin rather than C sin.
Thanks
A lookup table is the standard solution. You could Also use two lookup tables on for degrees and one for tenths of degrees and utilize sin(A + B) = sin(a)cos(b) + cos(A)sin(b)
For your fastSin(), you should check its documentation to see what range it's valid on. The units you're using for your game could be too big or too small and scaling them to fit within that function's expected range could make it work better.
EDIT:
Someone else mentioned getting it into the desired range by subtracting PI, but apparently there's a function called fmod for doing modulus division on floats/doubles, so this should do it:
#include <iostream>
#include <cmath>
float fastSin( float x ){
x = fmod(x + M_PI, M_PI * 2) - M_PI; // restrict x so that -M_PI < x < M_PI
const float B = 4.0f/M_PI;
const float C = -4.0f/(M_PI*M_PI);
float y = B * x + C * x * std::abs(x);
const float P = 0.225f;
return P * (y * std::abs(y) - y) + y;
}
int main() {
std::cout << fastSin(100.0) << '\n' << std::sin(100.0) << std::endl;
}
I have no idea how expensive fmod is though, so I'm going to try a quick benchmark next.
Benchmark Results
I compiled this with -O2 and ran the result with the Unix time program:
int main() {
float a = 0;
for(int i = 0; i < REPETITIONS; i++) {
a += sin(i); // or fastSin(i);
}
std::cout << a << std::endl;
}
The result is that sin is about 1.8x slower (if fastSin takes 5 seconds, sin takes 9). The accuracy also seemed to be pretty good.
If you chose to go this route, make sure to compile with optimization on (-O2 in gcc).
I know this is already an old topic, but for people who have the same question, here is a tip.
A lot of times in 2D and 3D rotation, all vectors are rotated with a fixed angle. In stead of calling the cos() or sin() every cycle of the loop, create variable before the loop which contains the value of cos(angle) or sin(angle) already. You can use this variable in your loop. This way the function only has to be called once.
If you rephrase the return in fastSin as
return (1-P) * y + P * (y * abs(y))
And rewrite y as (for x>0 )
y = 4 * x * (pi-x) / (pi * pi)
you can see that y is a parabolic first-order approximation to sin(x) chosen so that it passes through (0,0), (pi/2,1) and (pi,0), and is symmetrical about x=pi/2.
Thus we can only expect our function to be a good approximation from 0 to pi. If we want values outside that range we can use the 2-pi periodicity of sin(x) and that sin(x+pi) = -sin(x).
The y*abs(y) is a "correction term" which also passes through those three points. (I'm not sure why y*abs(y) is used rather than just y*y since y is positive in the 0-pi range).
This form of overall approximation function guarantees that a linear blend of the two functions y and y*y, (1-P)*y + P * y*y will also pass through (0,0), (pi/2,1) and (pi,0).
We might expect y to be a decent approximation to sin(x), but the hope is that by picking a good value for P we get a better approximation.
One question is "How was P chosen?". Personally, I'd chose the P that produced the least RMS error over the 0,pi/2 interval. (I'm not sure that's how this P was chosen though)
Minimizing this wrt. P gives
This can be rearranged and solved for p
Wolfram alpha evaluates the initial integral to be the quadratic
E = (16 π^5 p^2 - (96 π^5 + 100800 π^2 - 967680)p + 651 π^5 - 20160 π^2)/(1260 π^4)
which has a minimum of
min(E) = -11612160/π^9 + 2419200/π^7 - 126000/π^5 - 2304/π^4 + 224/π^2 + (169 π)/420
≈ 5.582129689596371e-07
at
p = 3 + 30240/π^5 - 3150/π^3
≈ 0.2248391013559825
Which is pretty close to the specified P=0.225.
You can raise the accuracy of the approximation by adding an additional correction term. giving a form something like return (1-a-b)*y + a y * abs(y) + b y * y * abs(y). I would find a and b by in the same way as above, this time giving a system of two linear equations in a and b to solve, rather than a single equation in p. I'm not going to do the derivation as it is tedious and the conversion to latex images is painful... ;)
NOTE: When answering another question I thought of another valid choice for P.
The problem is that using reflection to extend the curve into (-pi,0) leaves a kink in the curve at x=0. However, I suspect we can choose P such that the kink becomes smooth.
To do this take the left and right derivatives at x=0 and ensure they are equal. This gives an equation for P.
You can compute a table S of 256 values, from sin(0) to sin(2 * pi). Then, to pick sin(x), bring back x in [0, 2 * pi], you can pick 2 values S[a], S[b] from the table, such as a < x < b. From this, linear interpolation, and you should have a fair approximation
memory saving trick : you actually need to store only from [0, pi / 2], and use symmetries of sin(x)
enhancement trick : linear interpolation can be a problem because of non-smooth derivatives, humans eyes is good at spotting such glitches in animation and graphics. Use cubic interpolation then.
What about
x*(0.0174532925199433-8.650935142277599*10^-7*x^2)
for deg and
x*(1-0.162716259904269*x^2)
for rad on -45, 45 and -pi/4 , pi/4 respectively?
This (i.e. the fastsin function) is approximating the sine function using a parabola. I suspect it's only good for values between -π and +π. Fortunately, you can keep adding or subtracting 2π until you get into this range. (Edited to specify what is approximating the sine function using a parabola.)
you can use this aproximation.
this solution use a quadratic curve :
http://www.starming.com/index.php?action=plugin&v=wave&ajax=iframe&iframe=fullviewonepost&mid=56&tid=4825

Why does division yield a vastly different result than multiplication by a fraction in floating points

I understand why floating point numbers can't be compared, and know about the mantissa and exponent binary representation, but I'm no expert and today I came across something I don't get:
Namely lets say you have something like:
float denominator, numerator, resultone, resulttwo;
resultone = numerator / denominator;
float buff = 1 / denominator;
resulttwo = numerator * buff;
To my knowledge different flops can yield different results and this is not unusual. But in some edge cases these two results seem to be vastly different. To be more specific in my GLSL code calculating the Beckmann facet slope distribution for the Cook-Torrance lighitng model:
float a = 1 / (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
float b = clampedCosHalfNormal * clampedCosHalfNormal - 1.0;
float c = facetSlopeRMS * facetSlopeRMS * clampedCosHalfNormal * clampedCosHalfNormal;
facetSlopeDistribution = a * exp(b/c);
yields very very different results to
float a = (facetSlopeRMS * facetSlopeRMS * pow(clampedCosHalfNormal, 4));
facetDlopeDistribution = exp(b/c) / a;
Why does it? The second form of the expression is problematic.
If I say try to add the second form of the expression to a color I get blacks, even though the expression should always evaluate to a positive number. Am I getting an infinity? A NaN? if so why?
I didn't go through your mathematics in detail, but you must be aware that small errors get pumped up easily by all these powers and exponentials. You should try and substitute all variables var with var + e(var) (on paper, yes) and derive an expression for the total error - without simplifying in between steps, because that's where the error comes from!
This is also a very common problem in computational fluid dynamics, where you can observe things like 'numeric diffusion' if your grid isn't properly aligned with the simulated flow.
So get a clear grip on where the biggest errors come from, and rewrite equations where possible to minimize the numeric error.
edit: to clarify, an example
Say you have some variable x and an expression y=exp(x). The error in x is denoted e(x) and is small compared to x (say e(x)/x < 0.0001, but note that this depends on the type you are using). Then you could say that
e(y) = y(x+e(x)) - y(x)
e(y) ~ dy/dx * e(x) (for small e(x))
e(y) = exp(x) * e(x)
So there's a magnification of the absolute error of exp(x), meaning that around x=0 there's really no issue (not a surprise, since at that point the slope of exp(x) equals that of x) , but for big x you will notice this.
The relative error would then be
e(y)/y = e(y)/exp(x) = e(x)
whilst the relative error in x was
e(x)/x
so you added a factor of x to the relative error.