Differing floating point behaviour between uniform and constants in GLSL - opengl

I am trying to implement emulated double-precision in GLSL, and I observe a strange behaviour difference leading to subtle floating point errors in GLSL.
Consider the following fragment shader, writing to a 4-float texture to print the output.
layout (location = 0) out vec4 Output
uniform float s;
void main()
{
float a = 0.1f;
float b = s;
const float split = 8193.0; // = 2^13 + 1
float ca = split * a;
float cb = split * b;
float v1a = ca - (ca - a);
float v1b = cb - (cb - b);
Output = vec4(a,b,v1a,v1b);
}
This is the output I observe
GLSL output with uniform :
a = 0.1 0x3dcccccd
b = 2.86129e-06 0x36400497
v1a = 0.0999756 0x3dccc000
v1b = 2.86129e-06 0x36400497
Now, with the given values of b1 and b2 as inputs, the value of v2b does not have the expected result. Or at least it does not have the same result as on CPU (as seen here):
C++ output :
a = 0.100000 0x3dcccccd
b = 0.000003 0x36400497
v1a = 0.099976 0x3dccc000
v1b = 0.000003 0x36400000
Note the discrepancy for the value of v1b (0x36400497 vs 0x36400000).
So in an effort to figure out what was happening (and who was right) I attempted to redo the computation in GLSL, replacing the uniform value by a constant, using a slightly modified shader, where I replaced the uniform by it value.
layout (location = 0) out vec4 Output
void main()
{
float a = 0.1f;
float b = uintBitsToFloat(0x36400497u);
const float split = 8193.0; // = 2^13 + 1
float ca = split * a;
float cb = split * b;
float v1a = ca - (ca - a);
float v1b = cb - (cb - b);
Output = vec4(a,b,v1a,v1b);
}
This time, I get the same output as the C++ version of the same computation.
GLSL output with constants :
a = 0.1 0x3dcccccd
b = 2.86129e-06 0x36400497
v1a = 0.0999756 0x3dccc000
v1b = 2.86102e-06 0x36400000
My question is, what makes the floating point computation behave differently between a uniform variable and a constant ? Is this some kind of behind-the-scenes compiler optimization ?
Here are my OpenGL vendor strings from my laptop's intel GPU, but I also observed the same behaviour on a nVidia card as well.
Renderer : Intel(R) HD Graphics 520
Vendor : Intel
OpenGL : 4.5.0 - Build 23.20.16.4973
GLSL : 4.50 - Build 23.20.16.4973

So, as mentioned by #njuffa, in the comments, the problem was solved by using the precise modifier on the values which depended on rigorous IEEE754 operations:
layout (location = 0) out vec4 Output
uniform float s;
void main()
{
float a = 0.1f;
float b = s;
const float split = 8193.0; // = 2^13 + 1
precise float ca = split * a;
precise float cb = split * b;
precise float v1a = ca - (ca - a);
precise float v1b = cb - (cb - b);
Output = vec4(a,b,v1a,v1b);
}
Output :
a = 0.1 0x3dcccccd
b = 2.86129e-06 0x36400497
v1a = 0.0999756 0x3dccc000
v1b = 2.86102e-06 0x36400000
Edit: it is higly probable that only the last precise are needed to constrain the operations leading to its computation to avoid unwanted optimizations.

GPU do not necessarily have/use IEEE 754 some implementations have smaller number of bits so there is no brainer the result will be different. Its the same as you would compare float vs. double results on FPU. However you can try to enforce precision if your GLSL implementation allows it see:
In OpenGL ES 2.0 / GLSL, where do you need precision specifiers?
not sure if it is also for standard GL/GLSL as I never used it.
In worse case use double and dvec if your GPU allows it but beware there are no 64bit interpolators yet (at least to my knowledge).
To rule out rounding due to passing results by texture see:
GLSL debug prints
You can also check the number of mantissa bits on your GPU simply by printing
1.0+1.0/2.0
1.0+1.0/4.0
1.0+1.0/8.0
1.0+1.0/16.0
...
1.0+1.0/2.0^i
The i of last number not printed as 1.0 is the number of the mantissa bits. So you can check if it is 23 or not ...

Related

Cuda math vs C++ math

I implemented the same algorithm on CPU using C++ and on GPU using CUDA. In this algorithm I have to solve an integral numerically, since there are no analytic answer to it. The function I have to integrate is a weird polynomial of a curve and at the end there is an exp function.
In C++
for(int l = 0; l < 200; l++)
{
integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*exp(-a6*r_int)*step;
r_int = r_int + step;
}
In CUDA
for(int l = 0; l < 200; l++)
{
integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*__expf(-a6*r_int)*step;
r_int = r_int + step;
}
Output:
CPU: dose_output=0.00165546
GPU: dose_output=0.00142779
I think that the exp function of math.h and the __expf function of CUDA are not calculating the same thing. I tried to remove the --use_fast_math compiler flag thinking that it was the cause, but it seems that both implementation are diverging by around 20%.
I'm using CUDA to accelerate medical physics algorithms and these kind of differences are not very good since I have to proove that one of the outputs is "more true" than the other, and it could obviously be catastrophic for patients.
Does the difference comes from the function itself? Otherwise, I'm thinking that it might come from the memcopy of the a_i factors or the way I fetch them.
Edit: "Complete" code
float a0 = 5.9991e-04;
float a1 = -1.4694e-02;
float a2 = 1.1588;
float a3 = 4.5675e-01;
float a4 = -3.8617e-03;
float a5 = 3.2066e-03;
float a6 = 4.7050e-01;
float integral = 0.0;
float r_int = 5.0;
float step = 0.1/200;
for(int l = 0; l < 200; l++)
{
integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*exp(-a6*r_int)*step;
r_int = r_int + step;
}
cout << "Integral=" << integral << endl;
I would suggest running this part both on a gpu and a cpu.
Values from Carleton's seed database
You are using the less accurate implementation of exp() from the CUDA API.
Basically you could use three versions of exp() on the device :
exp(), the more accurate one
expf(), which is a single-precision "equivalent"
__expf(), which is an intrinsic version of the previous one, and the less accurate
You can read more about the different implementations of mathematical functions, including double-precision, single-precision and intrinsic versions, in the Mathematical Functions Appendix of the CUDA documentation :
D.2. Intrinsic Functions
The functions from this section can only be used in device code.
Among these functions are the less accurate, but faster versions of
some of the functions of Standard Functions .They have the same name
prefixed with __ (such as __sinf(x)). They are faster as they map to
fewer native instructions.
In the same page you will read that the compiler option you removed justs prevent every function from being replaced by its intrinsic version. As you explicitely use an intrinsic version of exp(), removing this flag has no change for you :
The compiler has an option (-use_fast_math) that forces each function
in Table 8 to compile to its intrinsic counterpart.

surprising behavior of trigonometric functions in webgl fragment shader

In the following shader, m1 and m2 should have the same value because cos(asin(x)) == sqrt(1.0 - x*x).
However, the field produced using m1 shows a black ring in the lower left corner whereas m2 produces the expected smooth field:
precision highp float;
void main() {
float scale = 10000.0;
float p = length(gl_FragCoord.xy / scale);
float m1 = cos(asin(p));
float m2 = sqrt(1.0 - p*p);
float v = asin(m1); // change to m2 to see correct behavior
float c = degrees(v) / 90.0;
gl_FragColor = vec4(vec3(c), 1.0);
}
This behavior is really puzzling. What explains the black ring? I thought it may be a precision issue, but highp produces the same result. Or perhaps the black ring represents NaN results, but NaNs shouldn't occur there.
This replicates on MacOS 10.10.5 in Chrome/FF. Does not replicate on Windows 10 or iOS 9.3.3. Would something like this be a driver issue?
(For the curious, these formulas calculate latitude for an orthographic projection centered on the north pole.)
--UPDATE--
Confirmed today that MacOS 10.11.6 does not show the rendering error. This really seems like a driver/OS issue.
According to the spec
asin(x) : Results are undefined if ∣x∣ > 1.
and
sqrt(x) : Results are undefined if x < 0.
Do either of those point out the issue?
Try
float m1 = cos(asin(clamp(p, -1., 1.)));
float m2 = sqrt(abs(1.0 - p*p));

glsl (GPU) matrix/vector calculations yielding different results than CPU

I can't find any documentation of different behavior, so this is just a sanity check that I'm not doing anything wrong...
I've created some helper functions in GLSL to output float/vec/mat comparisons as a color:
note: pretty sure there aren't any errors here, just including it so you know exactly what I'm doing...
//returns true or false if floats are eq (within some epsillon)
bool feq(float a, float b)
{
float c = a-b;
return (c > -0.05 && c < 0.05);
}
returns true or false if vecs are eq
bool veq(vec4 a, vec4 b)
{
return
(
feq(a.x, b.x) &&
feq(a.y, b.y) &&
feq(a.z, b.z) &&
feq(a.w, b.w) &&
true
);
}
//returns color indicating where first diff lies between vecs
//white for "no diff"
vec4 cveq(vec4 a, vec4 b)
{
if(!feq(a.x, b.x)) return vec4(1.,0.,0.,1.);
else if(!feq(a.y, b.y)) return vec4(0.,1.,0.,1.);
else if(!feq(a.z, b.z)) return vec4(0.,0.,1.,1.);
else if(!feq(a.w, b.w)) return vec4(1.,1.,0.,1.);
else return vec4(1.,1.,1.,1.);
}
//returns true or false if mats are eq
bool meq(mat4 a, mat4 b)
{
return
(
veq(a[0],b[0]) &&
veq(a[1],b[1]) &&
veq(a[2],b[2]) &&
veq(a[3],b[3]) &&
true
);
}
//returns color indicating where first diff lies between mats
//white means "no diff"
vec4 cmeq(mat4 a, mat4 b)
{
if(!veq(a[0],b[0])) return vec4(1.,0.,0.,1.);
else if(!veq(a[1],b[1])) return vec4(0.,1.,0.,1.);
else if(!veq(a[2],b[2])) return vec4(0.,0.,1.,1.);
else if(!veq(a[3],b[3])) return vec4(1.,1.,0.,1.);
else return vec4(1.,1.,1.,1.);
}
So I have a model mat, a view mat, and a proj mat. I'm rendering a rectangle on screen (that is correctly projected/transformed...), and setting its color based on how well each steps of the calculations match with my on-cpu-calculated equivalents.
uniform mat4 model_mat;
uniform mat4 view_mat;
uniform mat4 proj_mat;
attribute vec4 position;
varying vec4 var_color;
void main()
{
//this code works (at least visually)- the rect is transformed as expected
vec4 model_pos = model_mat * position;
gl_Position = proj_mat * view_mat * model_pos;
//this is the test code that does the same as above, but tests its results against CPU calculated equivalents
mat4 m;
//test proj
//compares the passed in uniform 'proj_mat' against a hardcoded rep of 'proj_mat' as printf'd by the CPU
m[0] = vec4(1.542351,0.000000,0.000000,0.000000);
m[1] = vec4(0.000000,1.542351,0.000000,0.000000);
m[2] = vec4(0.000000,0.000000,-1.020202,-1.000000);
m[3] = vec4(0.000000,0.000000,-2.020202,0.000000);
var_color = cmeq(proj_mat,m); //THIS PASSES (the rect is white)
//view
//compares the passed in uniform 'view_mat' against a hardcoded rep of 'view_mat' as printf'd by the CPU
m[0] = vec4(1.000000,0.000000,-0.000000,0.000000);
m[1] = vec4(-0.000000,0.894427,0.447214,0.000000);
m[2] = vec4(0.000000,-0.447214,0.894427,0.000000);
m[3] = vec4(-0.000000,-0.000000,-22.360680,1.000000);
var_color = cmeq(view_mat,m); //THIS PASSES (the rect is white)
//projview
mat4 pv = proj_mat*view_mat;
//proj_mat*view_mat
//compares the result of GPU computed proj*view against a hardcoded rep of proj*view **<- NOTE ORDER** as printf'd by the CPU
m[0] = vec4(1.542351,0.000000,0.000000,0.000000);
m[1] = vec4(0.000000,1.379521,-0.689760,0.000000);
m[2] = vec4(0.000000,-0.456248,-0.912496,20.792208);
m[3] = vec4(0.000000,-0.447214,-0.894427,22.360680);
var_color = cmeq(pv,m); //THIS FAILS (the rect is green)
//view_mat*proj_mat
//compares the result of GPU computed proj*view against a hardcoded rep of view*proj **<- NOTE ORDER** as printf'd by the CPU
m[0] = vec4(1.542351,0.000000,0.000000,0.000000);
m[1] = vec4(0.000000,1.379521,0.456248,0.903462);
m[2] = vec4(0.000000,0.689760,21.448183,-1.806924);
m[3] = vec4(0.000000,0.000000,-1.000000,0.000000);
var_color = cmeq(pv,m); //THIS FAILS (the rect is green)
//view_mat_t*proj_mat_t
//compares the result of GPU computed proj*view against a hardcoded rep of view_t*proj_t **<- '_t' = transpose, also note order** as printf'd by the CPU
m[0] = vec4(1.542351,0.000000,0.000000,0.000000);
m[1] = vec4(0.000000,1.379521,-0.456248,-0.447214);
m[2] = vec4(0.000000,-0.689760,-0.912496,-0.894427);
m[3] = vec4(0.000000,0.000000,20.792208,22.360680);
var_color = cmeq(pv,m); //THIS PASSES (the rect is white)
}
And here are my CPU vector/matrix calcs (matrices are col-order [m.x is first column, not first row]):
fv4 matmulfv4(fm4 m, fv4 v)
{
return fv4
{ m.x[0]*v.x+m.y[0]*v.y+m.z[0]*v.z+m.w[0]*v.w,
m.x[1]*v.x+m.y[1]*v.y+m.z[1]*v.z+m.w[1]*v.w,
m.x[2]*v.x+m.y[2]*v.y+m.z[2]*v.z+m.w[2]*v.w,
m.x[3]*v.x+m.y[3]*v.y+m.z[3]*v.z+m.w[3]*v.w };
}
fm4 mulfm4(fm4 a, fm4 b)
{
return fm4
{ { a.x[0]*b.x[0]+a.y[0]*b.x[1]+a.z[0]*b.x[2]+a.w[0]*b.x[3], a.x[0]*b.y[0]+a.y[0]*b.y[1]+a.z[0]*b.y[2]+a.w[0]*b.y[3], a.x[0]*b.z[0]+a.y[0]*b.z[1]+a.z[0]*b.z[2]+a.w[0]*b.z[3], a.x[0]*b.w[0]+a.y[0]*b.w[1]+a.z[0]*b.w[2]+a.w[0]*b.w[3] },
{ a.x[1]*b.x[0]+a.y[1]*b.x[1]+a.z[1]*b.x[2]+a.w[1]*b.x[3], a.x[1]*b.y[0]+a.y[1]*b.y[1]+a.z[1]*b.y[2]+a.w[1]*b.y[3], a.x[1]*b.z[0]+a.y[1]*b.z[1]+a.z[1]*b.z[2]+a.w[1]*b.z[3], a.x[1]*b.w[0]+a.y[1]*b.w[1]+a.z[1]*b.w[2]+a.w[1]*b.w[3] },
{ a.x[2]*b.x[0]+a.y[2]*b.x[1]+a.z[2]*b.x[2]+a.w[2]*b.x[3], a.x[2]*b.y[0]+a.y[2]*b.y[1]+a.z[2]*b.y[2]+a.w[2]*b.y[3], a.x[2]*b.z[0]+a.y[2]*b.z[1]+a.z[2]*b.z[2]+a.w[2]*b.z[3], a.x[2]*b.w[0]+a.y[2]*b.w[1]+a.z[2]*b.w[2]+a.w[2]*b.w[3] },
{ a.x[3]*b.x[0]+a.y[3]*b.x[1]+a.z[3]*b.x[2]+a.w[3]*b.x[3], a.x[3]*b.y[0]+a.y[3]*b.y[1]+a.z[3]*b.y[2]+a.w[3]*b.y[3], a.x[3]*b.z[0]+a.y[3]*b.z[1]+a.z[3]*b.z[2]+a.w[3]*b.z[3], a.x[3]*b.w[0]+a.y[3]*b.w[1]+a.z[3]*b.w[2]+a.w[3]*b.w[3] } };
}
A key thing to notice is that the view_mat_t * proj_mat_t on the CPU matched the proj_mat * view_mat on the GPU. Does anyone know why? I've done tests on matrices on the CPU and compared them to results of online matrix multipliers, and they seem correct...
I know that the GPU does things between vert shader and frag shader (I think it like, divides gl_Position by gl_Position.w or something?)... is there something else I'm not taking into account going on here in just the vert shader? Is something being auto-transposed at some point?
You may wish to consider GLM for CPU-side Matrix instantiation and calculations. It'll help reduce possible sources of errors.
Secondly, GPUs and CPUs do not perform identical calculations. The IEEE 754 standard for computing Floating Point Numbers has relatively rigorous standards for how these calculations have to be performed and to what degree they have to be accurate, but:
It's still possible for numbers to come up different in the least significant bit (and more than that depending on the specific operation/function being used)
Some GPU vendors opt out of ensuring strict IEEE compliance in the first place (Nvidia has been known in the past to prioritize Speed over strict IEEE compliance)
I would finally note that your CPU-side computations leave a lot of room for rounding errors, which can add up. The usual advice for these kinds of questions, then, is to include tolerance in your code for small amounts of deviations. Usually code to check for 'equality' of two floating point numbers presumes that abs(x-y) < 0.000001 means x and y are essentially equal. Naturally, the specific number will have to be calibrated for your personal use.
And of course, you'll want to check to make sure that all your matrices/uniforms are being passed in correctly.
Ok. I've found an answer. There is nothing special about matrix operations from within a single shader. There are, however, a couple things you should be aware of:
:1: OpenGL (GLSL) uses column-major matrices. So to construct the matrix that would be visually represented in a mathematical context as this:
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
you would, from within GLSL use:
mat4 m = mat4(
vec4( 1, 5, 9,13),
vec4( 2, 6,10,14),
vec4( 3, 7,11,15),
vec4( 4, 8,12,16),
);
:2: If you instead use row-major matrices on the CPU, make sure to set the "transpose" flag to true when uploading the matrix uniforms to the shader, and make sure to set it to false if you're using col-major matrices.
So long as you are aware of these two things, you should be good to go.
My particular problem above was that I was in the middle of switching from row-major to col-major in my CPU implementation and wasn't thorough in ensuring that implementation was taken into account across all my CPU matrix operations.
Specifically, here is my now-correct mat4 multiplication implementation, assuming col-major matrices:
fm4 mulfm4(fm4 a, fm4 b)
{
return fm4
{ { a.x[0]*b.x[0] + a.y[0]*b.x[1] + a.z[0]*b.x[2] + a.w[0]*b.x[3], a.x[1]*b.x[0] + a.y[1]*b.x[1] + a.z[1]*b.x[2] + a.w[1]*b.x[3], a.x[2]*b.x[0] + a.y[2]*b.x[1] + a.z[2]*b.x[2] + a.w[2]*b.x[3], a.x[3]*b.x[0] + a.y[3]*b.x[1] + a.z[3]*b.x[2] + a.w[3]*b.x[3] },
{ a.x[0]*b.y[0] + a.y[0]*b.y[1] + a.z[0]*b.y[2] + a.w[0]*b.y[3], a.x[1]*b.y[0] + a.y[1]*b.y[1] + a.z[1]*b.y[2] + a.w[1]*b.y[3], a.x[2]*b.y[0] + a.y[2]*b.y[1] + a.z[2]*b.y[2] + a.w[2]*b.y[3], a.x[3]*b.y[0] + a.y[3]*b.y[1] + a.z[3]*b.y[2] + a.w[3]*b.y[3] },
{ a.x[0]*b.z[0] + a.y[0]*b.z[1] + a.z[0]*b.z[2] + a.w[0]*b.z[3], a.x[1]*b.z[0] + a.y[1]*b.z[1] + a.z[1]*b.z[2] + a.w[1]*b.z[3], a.x[2]*b.z[0] + a.y[2]*b.z[1] + a.z[2]*b.z[2] + a.w[2]*b.z[3], a.x[3]*b.z[0] + a.y[3]*b.z[1] + a.z[3]*b.z[2] + a.w[3]*b.z[3] },
{ a.x[0]*b.w[0] + a.y[0]*b.w[1] + a.z[0]*b.w[2] + a.w[0]*b.w[3], a.x[1]*b.w[0] + a.y[1]*b.w[1] + a.z[1]*b.w[2] + a.w[1]*b.w[3], a.x[2]*b.w[0] + a.y[2]*b.w[1] + a.z[2]*b.w[2] + a.w[2]*b.w[3], a.x[3]*b.w[0] + a.y[3]*b.w[1] + a.z[3]*b.w[2] + a.w[3]*b.w[3] } };
}
again, the above implementation is for column major matrices. That means that a.x is the first column of the matrix, not the row.
A key thing to notice is that the view_mat_t * proj_mat_t on the CPU
matched the proj_mat * view_mat on the GPU. Does anyone know why?
The reason for this is that for two matrices A, B: A * B = (B' * A')', where ' indicates the transpose operation. As already pointed out by yourself, your math code (as well as popular math libraries such as GLM) uses a row-major representation of matrices, while OpenGL (by default) uses a column-major representation. What this means is that the matrix A,
(a b c)
A = (d e f)
(g h i)
in your CPU math library is stored in memory as [a, b, c, d, e, f, g, h, i], whereas defined in a GLSL shader, it would be stored as [a, d, g, b, e, h, c, f, i]. So if you upload the data [a, b, c, d, e, f, g, h, i] of the GLM matrix with glUniformMatrix3fv with the transpose parameter set to GL_FALSE, then the matrix you will see in GLSL is
(a d g)
A' = (b e h)
(c f i)
which is the transposed original matrix. Having realized that changing the interpretation of the matrix data between row-major and column-major leads to a transposed version of the original matrix, you can now explain why suddenly the matrix multiplication works the other way around. Your view_mat_t and proj_mat_t on the CPU are interpreted as view_mat_t' and proj_mat_t' in your GLSL shader, so uploading the pre-calculated view_mat_t * proj_mat_t to the shader will lead to the same result as uploading both matrices separately and then calculating proj_mat_t * view_mat_t.

Double's multiplication is less precise than float's one

Suppose we have the equation y = k1 * x + b1 = k2 * x + b2. Let's calculate x in floats. I know that it's a bad choice, but I want to understand the reason for results I get. Also let's calculate y using this x and then do the same, but with double(x). Consider this code:
std::cout.precision(20);
float k1, b1, k2, b2;
std::cin >> k1 >> b1 >> k2 >> b2;
float x_f = (b2 - b1) / (k1 - k2);
double x_d = x_f;
printFloat(x_f); // my function which prints number and it's binary representation
printDouble(x_d);
float y_f = x_f * k1 + b1;
double y_d = x_d * k1 + b1;
printFloat(y_f);
printDouble(y_d);
And with k1 = -4653, b1 = 9968, k2 = 520, b2 = -1370 surprisingly got the following results:
x_f = 2.19176483154296875 01000000000011000100010111100000
x_d = 2.19176483154296875 0100000000000001100010001011110000000000000000000000000000000000
y_f = -230.2822265625 11000011011001100100100001000000
y_d = -230.28176116943359375 1100000001101100110010010000010000110000000000000000000000000000
Whereas more precise answer (calculated with Python Decimal) is:
x = 2.191764933307558476705973323023390682389
y = -230.28223468006959211289387202783684516
And the float's answer is closer than the double's one! Why is this happening? I've debugged with gdb (compiled on 64-bit Ubuntu 14.04 g++ 4.8.4) and viewed the instructions, and they are all ok, so it's due to multiplication.
It's a coincidence that the rounding cancels out and ends up being closer with float than with double. The origin of the difference is that x_d * k1 is promoted to double whereas x_f * k1 is evaluated as float.
To provide a simpler example of how this kind of rounding can cause a lower-precision type to produce a more accurate answer, consider two new numeric types called sf2 and sf3, each of which store base-10 numbers, with 2 and 3 significant digits, respectively. Then consider the following calculation:
// Calculate (5 / 4) * 8. Expected result: 10
sf2 x_2 = 5.0 / 4.0; // 1.3
sf2 y_2 = x_2 * 8.0; // 10
sf3 x_3 = x_2; // 1.30
sf3 y_3 = x_3 * 8.0; // 10.4
Note that using the above types, even though all sf2 values are representable in the sf3 type, the sf2 calculation is more accurate. This is because the round-up of 1.25 to 1.3 when calculating x_2 is exactly canceled when rounding 10.4 to 10. But when the second calculation is done using the sf3 type, the initial round-up is persisted but the round-down no longer occurs.
This is an example of the many pitfalls you will encounter when dealing with floating point types.

Efficient Bicubic filtering code in GLSL?

I'm wondering if anyone has complete, working, and efficient code to do bicubic texture filtering in glsl. There is this:
http://www.codeproject.com/Articles/236394/Bi-Cubic-and-Bi-Linear-Interpolation-with-GLSL
or
https://github.com/visionworkbench/visionworkbench/blob/master/src/vw/GPU/Shaders/Interp/interpolation-bicubic.glsl
but both do 16 texture reads where only 4 are necessary:
https://groups.google.com/forum/#!topic/comp.graphics.api.opengl/kqrujgJfTxo
However the method above uses a missing "cubic()" function that I don't know what it is supposed to do, and also takes an unexplained "texscale" parameter.
There is also the NVidia version:
https://developer.nvidia.com/gpugems/gpugems2/part-iii-high-quality-rendering/chapter-20-fast-third-order-texture-filtering
but I believe this uses CUDA, which is specific to NVidia's cards. I need glsl.
I could probably port the nvidia version to glsl, but thought I'd ask first to see if anyone already has a complete, working glsl bicubic shader.
I found this implementation which can be used as a drop-in replacement for texture() (from http://www.java-gaming.org/index.php?topic=35123.0 (one typo fixed)):
// from http://www.java-gaming.org/index.php?topic=35123.0
vec4 cubic(float v){
vec4 n = vec4(1.0, 2.0, 3.0, 4.0) - v;
vec4 s = n * n * n;
float x = s.x;
float y = s.y - 4.0 * s.x;
float z = s.z - 4.0 * s.y + 6.0 * s.x;
float w = 6.0 - x - y - z;
return vec4(x, y, z, w) * (1.0/6.0);
}
vec4 textureBicubic(sampler2D sampler, vec2 texCoords){
vec2 texSize = textureSize(sampler, 0);
vec2 invTexSize = 1.0 / texSize;
texCoords = texCoords * texSize - 0.5;
vec2 fxy = fract(texCoords);
texCoords -= fxy;
vec4 xcubic = cubic(fxy.x);
vec4 ycubic = cubic(fxy.y);
vec4 c = texCoords.xxyy + vec2 (-0.5, +1.5).xyxy;
vec4 s = vec4(xcubic.xz + xcubic.yw, ycubic.xz + ycubic.yw);
vec4 offset = c + vec4 (xcubic.yw, ycubic.yw) / s;
offset *= invTexSize.xxyy;
vec4 sample0 = texture(sampler, offset.xz);
vec4 sample1 = texture(sampler, offset.yz);
vec4 sample2 = texture(sampler, offset.xw);
vec4 sample3 = texture(sampler, offset.yw);
float sx = s.x / (s.x + s.y);
float sy = s.z / (s.z + s.w);
return mix(
mix(sample3, sample2, sx), mix(sample1, sample0, sx)
, sy);
}
Example: Nearest, bilinear, bicubic:
The ImageData of this image is
{{{0.698039, 0.996078, 0.262745}, {0., 0.266667, 1.}, {0.00392157,
0.25098, 0.996078}, {1., 0.65098, 0.}}, {{0.996078, 0.823529,
0.}, {0.498039, 0., 0.00392157}, {0.831373, 0.00392157,
0.00392157}, {0.956863, 0.972549, 0.00784314}}, {{0.909804,
0.00784314, 0.}, {0.87451, 0.996078, 0.0862745}, {0.196078,
0.992157, 0.760784}, {0.00392157, 0.00392157, 0.498039}}, {{1.,
0.878431, 0.}, {0.588235, 0.00392157, 0.00392157}, {0.00392157,
0.0666667, 0.996078}, {0.996078, 0.517647, 0.}}}
I tried to reproduce this (many other interpolation techniques)
but they have clamped padding, while I have repeating (wrapping) boundaries. Therefore it is not exactly the same.
It seems this bicubic business is not a proper interpolation, i.e. it does not take on the original values at the points where the data is defined.
I decided to take a minute to dig my old Perforce activities and found the missing cubic() function; enjoy! :)
vec4 cubic(float v)
{
vec4 n = vec4(1.0, 2.0, 3.0, 4.0) - v;
vec4 s = n * n * n;
float x = s.x;
float y = s.y - 4.0 * s.x;
float z = s.z - 4.0 * s.y + 6.0 * s.x;
float w = 6.0 - x - y - z;
return vec4(x, y, z, w);
}
Wow. I recognize the code above (I can not comment w/ reputation < 50) as I came up with it in early 2011. The problem I was trying to solve was related to old IBM T42 (sorry the exact model number escapes me) laptop and it's ATI graphics stack. I developed the code on NV card and originally I used 16 texture fetches. That was kinda of slow but fast enough for my purposes. When someone reported it did not work on his laptop it became apparent that they did not support enough texture fetches per fragment. I had to engineer a work-around and the best I could come up with was to do it with number of texture fetches that would work.
I thought about it like this: okay, so if I handle each quad (2x2) with linear filter the remaining problem is can the rows and columns share the weights? That was the only problem on my mind when I set out to craft the code. Of course they could be shared; the weights are same for each column and row; perfect!
Now I had four samples. The remaining problem was how to correctly combine the samples. That was the biggest obstacle to overcome. It took about 10 minutes with pencil and paper. With trembling hands I typed the code in and it worked, nice. Then I uploaded the binaries to the guy who promised to check it out on his T42 (?) and he reported it worked. The end. :)
I can assure that the equations check out and give mathematically identical results to computing the samples individually. FYI: with CPU it's faster to do horizontal and vertical scan separately. With GPU multiple passes is not that great idea, especially when it's probably not feasible anyway in typical use case.
Food for thought: it is possible to use a texture lookup for the cubic() function. Which is faster depends on the GPU but generally speaking, the sampler is light on the ALU side just doing the arithmetic would balance things out. YMMV.
The missing function cubic() in JAre's answer could look like this:
vec4 cubic(float x)
{
float x2 = x * x;
float x3 = x2 * x;
vec4 w;
w.x = -x3 + 3*x2 - 3*x + 1;
w.y = 3*x3 - 6*x2 + 4;
w.z = -3*x3 + 3*x2 + 3*x + 1;
w.w = x3;
return w / 6.f;
}
It returns the four weights for cubic B-Spline.
It is all explained in NVidia Gems.
(EDIT)
Cubic() is a cubic spline function
Example:
Texscale is sampling window size coefficient. You can start with 1.0 value.
vec4 filter(sampler2D texture, vec2 texcoord, vec2 texscale)
{
float fx = fract(texcoord.x);
float fy = fract(texcoord.y);
texcoord.x -= fx;
texcoord.y -= fy;
vec4 xcubic = cubic(fx);
vec4 ycubic = cubic(fy);
vec4 c = vec4(texcoord.x - 0.5, texcoord.x + 1.5, texcoord.y -
0.5, texcoord.y + 1.5);
vec4 s = vec4(xcubic.x + xcubic.y, xcubic.z + xcubic.w, ycubic.x +
ycubic.y, ycubic.z + ycubic.w);
vec4 offset = c + vec4(xcubic.y, xcubic.w, ycubic.y, ycubic.w) /
s;
vec4 sample0 = texture2D(texture, vec2(offset.x, offset.z) *
texscale);
vec4 sample1 = texture2D(texture, vec2(offset.y, offset.z) *
texscale);
vec4 sample2 = texture2D(texture, vec2(offset.x, offset.w) *
texscale);
vec4 sample3 = texture2D(texture, vec2(offset.y, offset.w) *
texscale);
float sx = s.x / (s.x + s.y);
float sy = s.z / (s.z + s.w);
return mix(
mix(sample3, sample2, sx),
mix(sample1, sample0, sx), sy);
}
Source
For anybody interested in GLSL code to do tri-cubic interpolation, ray-casting code using cubic interpolation can be found in the examples/glCubicRayCast folder in:
http://www.dannyruijters.nl/cubicinterpolation/CI.zip
edit: The cubic interpolation code is now available on github: CUDA version and WebGL version, and GLSL sample.
I've been using #Maf 's cubic spline recipe for over a year, and I recommend it, if a cubic B-spline meets your needs.
But I recently realized that, for my particular application, it is important for the intensities to match exactly at the sample points. So I switched to using a Catmull-Rom spline, which uses a slightly different recipe like so:
// Catmull-Rom spline actually passes through control points
vec4 cubic(float x) // cubic_catmullrom(float x)
{
const float s = 0.5; // potentially adjustable parameter
float x2 = x * x;
float x3 = x2 * x;
vec4 w;
w.x = -s*x3 + 2*s*x2 - s*x + 0;
w.y = (2-s)*x3 + (s-3)*x2 + 1;
w.z = (s-2)*x3 + (3-2*s)*x2 + s*x + 0;
w.w = s*x3 - s*x2 + 0;
return w;
}
I found these coefficients, plus those for a number of other flavors of cubic splines, in the lecture notes at:
http://www.cs.cmu.edu/afs/cs/academic/class/15462-s10/www/lec-slides/lec06.pdf
I think it is possible that the Catmull version could be done with 4 texture lookups by (a) arranging the input texture like a chessboard with alternate slots saved as positives and as negatives, and (b) an associated modification of textureBicubic. That would rely on the contributions/weights w.x/w.w always being negative, and the contributions w.y/w.z always being positive. I haven't double-checked if this is true, or exactly how the modified textureBicubic would look.
... I have verified that w contributions do satisfy the +ve -ve rules.