surprising behavior of trigonometric functions in webgl fragment shader - glsl

In the following shader, m1 and m2 should have the same value because cos(asin(x)) == sqrt(1.0 - x*x).
However, the field produced using m1 shows a black ring in the lower left corner whereas m2 produces the expected smooth field:
precision highp float;
void main() {
float scale = 10000.0;
float p = length(gl_FragCoord.xy / scale);
float m1 = cos(asin(p));
float m2 = sqrt(1.0 - p*p);
float v = asin(m1); // change to m2 to see correct behavior
float c = degrees(v) / 90.0;
gl_FragColor = vec4(vec3(c), 1.0);
}
This behavior is really puzzling. What explains the black ring? I thought it may be a precision issue, but highp produces the same result. Or perhaps the black ring represents NaN results, but NaNs shouldn't occur there.
This replicates on MacOS 10.10.5 in Chrome/FF. Does not replicate on Windows 10 or iOS 9.3.3. Would something like this be a driver issue?
(For the curious, these formulas calculate latitude for an orthographic projection centered on the north pole.)
--UPDATE--
Confirmed today that MacOS 10.11.6 does not show the rendering error. This really seems like a driver/OS issue.

According to the spec
asin(x) : Results are undefined if ∣x∣ > 1.
and
sqrt(x) : Results are undefined if x < 0.
Do either of those point out the issue?
Try
float m1 = cos(asin(clamp(p, -1., 1.)));
float m2 = sqrt(abs(1.0 - p*p));

Related

Differing floating point behaviour between uniform and constants in GLSL

I am trying to implement emulated double-precision in GLSL, and I observe a strange behaviour difference leading to subtle floating point errors in GLSL.
Consider the following fragment shader, writing to a 4-float texture to print the output.
layout (location = 0) out vec4 Output
uniform float s;
void main()
{
float a = 0.1f;
float b = s;
const float split = 8193.0; // = 2^13 + 1
float ca = split * a;
float cb = split * b;
float v1a = ca - (ca - a);
float v1b = cb - (cb - b);
Output = vec4(a,b,v1a,v1b);
}
This is the output I observe
GLSL output with uniform :
a = 0.1 0x3dcccccd
b = 2.86129e-06 0x36400497
v1a = 0.0999756 0x3dccc000
v1b = 2.86129e-06 0x36400497
Now, with the given values of b1 and b2 as inputs, the value of v2b does not have the expected result. Or at least it does not have the same result as on CPU (as seen here):
C++ output :
a = 0.100000 0x3dcccccd
b = 0.000003 0x36400497
v1a = 0.099976 0x3dccc000
v1b = 0.000003 0x36400000
Note the discrepancy for the value of v1b (0x36400497 vs 0x36400000).
So in an effort to figure out what was happening (and who was right) I attempted to redo the computation in GLSL, replacing the uniform value by a constant, using a slightly modified shader, where I replaced the uniform by it value.
layout (location = 0) out vec4 Output
void main()
{
float a = 0.1f;
float b = uintBitsToFloat(0x36400497u);
const float split = 8193.0; // = 2^13 + 1
float ca = split * a;
float cb = split * b;
float v1a = ca - (ca - a);
float v1b = cb - (cb - b);
Output = vec4(a,b,v1a,v1b);
}
This time, I get the same output as the C++ version of the same computation.
GLSL output with constants :
a = 0.1 0x3dcccccd
b = 2.86129e-06 0x36400497
v1a = 0.0999756 0x3dccc000
v1b = 2.86102e-06 0x36400000
My question is, what makes the floating point computation behave differently between a uniform variable and a constant ? Is this some kind of behind-the-scenes compiler optimization ?
Here are my OpenGL vendor strings from my laptop's intel GPU, but I also observed the same behaviour on a nVidia card as well.
Renderer : Intel(R) HD Graphics 520
Vendor : Intel
OpenGL : 4.5.0 - Build 23.20.16.4973
GLSL : 4.50 - Build 23.20.16.4973
So, as mentioned by #njuffa, in the comments, the problem was solved by using the precise modifier on the values which depended on rigorous IEEE754 operations:
layout (location = 0) out vec4 Output
uniform float s;
void main()
{
float a = 0.1f;
float b = s;
const float split = 8193.0; // = 2^13 + 1
precise float ca = split * a;
precise float cb = split * b;
precise float v1a = ca - (ca - a);
precise float v1b = cb - (cb - b);
Output = vec4(a,b,v1a,v1b);
}
Output :
a = 0.1 0x3dcccccd
b = 2.86129e-06 0x36400497
v1a = 0.0999756 0x3dccc000
v1b = 2.86102e-06 0x36400000
Edit: it is higly probable that only the last precise are needed to constrain the operations leading to its computation to avoid unwanted optimizations.
GPU do not necessarily have/use IEEE 754 some implementations have smaller number of bits so there is no brainer the result will be different. Its the same as you would compare float vs. double results on FPU. However you can try to enforce precision if your GLSL implementation allows it see:
In OpenGL ES 2.0 / GLSL, where do you need precision specifiers?
not sure if it is also for standard GL/GLSL as I never used it.
In worse case use double and dvec if your GPU allows it but beware there are no 64bit interpolators yet (at least to my knowledge).
To rule out rounding due to passing results by texture see:
GLSL debug prints
You can also check the number of mantissa bits on your GPU simply by printing
1.0+1.0/2.0
1.0+1.0/4.0
1.0+1.0/8.0
1.0+1.0/16.0
...
1.0+1.0/2.0^i
The i of last number not printed as 1.0 is the number of the mantissa bits. So you can check if it is 23 or not ...

What is this technique, if not bilinear filtering?

I'm trying to replicate the automatic bilinear filtering algorithm of Unity3D using the next code:
fixed4 GetBilinearFilteredColor(float2 texcoord)
{
fixed4 s1 = SampleSpriteTexture(texcoord + float2(0.0, _MainTex_TexelSize.y));
fixed4 s2 = SampleSpriteTexture(texcoord + float2(_MainTex_TexelSize.x, 0.0));
fixed4 s3 = SampleSpriteTexture(texcoord + float2(_MainTex_TexelSize.x, _MainTex_TexelSize.y));
fixed4 s4 = SampleSpriteTexture(texcoord);
float2 TexturePosition = float2(texcoord)* _MainTex_TexelSize.z;
float fu = frac(TexturePosition.x);
float fv = frac(TexturePosition.y);
float4 tmp1 = lerp(s4, s2, fu);
float4 tmp2 = lerp(s1, s3, fu);
return lerp(tmp1, tmp2, fv);
}
fixed4 frag(v2f IN) : SV_Target
{
fixed4 c = GetBilinearFilteredColor(IN.texcoord) * IN.color;
c.rgb *= c.a;
return c;
}
I thought I was using the correct algoritm because is the only one I have seen out there for bilinear. But I tried it using unity with the same texture duplicated:
1º texture: is in Point filtering and using the custom bilinear shader (maded from the default sprite shader).
2º texture: is in Bilinear filter with the default sprite shader
And this is the result:
You can see that they are different and also there is some displacement in my custom shader that makes the sprite being off-center when rotating in the Z axis.
Any idea of what I'm doing wrong?
Any idea of what is doing Unity3D different?
Are there another algorithm's who fits in the Unity3D default filtering?
Solution
Updated with the complete code solution with Nico's code for other people who search for it here:
fixed4 GetBilinearFilteredColor(float2 texcoord)
{
fixed4 s1 = SampleSpriteTexture(texcoord + float2(0.0, _MainTex_TexelSize.y));
fixed4 s2 = SampleSpriteTexture(texcoord + float2(_MainTex_TexelSize.x, 0.0));
fixed4 s3 = SampleSpriteTexture(texcoord + float2(_MainTex_TexelSize.x, _MainTex_TexelSize.y));
fixed4 s4 = SampleSpriteTexture(texcoord);
float2 TexturePosition = float2(texcoord)* _MainTex_TexelSize.z;
float fu = frac(TexturePosition.x);
float fv = frac(TexturePosition.y);
float4 tmp1 = lerp(s4, s2, fu);
float4 tmp2 = lerp(s1, s3, fu);
return lerp(tmp1, tmp2, fv);
}
fixed4 frag(v2f IN) : SV_Target
{
fixed4 c = GetBilinearFilteredColor(IN.texcoord - 0.498 * _MainTex_TexelSize.xy) * IN.color;
c.rgb *= c.a;
return c;
}
And the image test with the result:
Why don't substract 0.5 exactly?
If you test it you will see some edge cases where it jumps to (pixel - 1).
Let's take a closer look at what you are actually doing. I will stick to the 1D case because it is easier to visualize.
You have an array of pixels and a texture position. I assume, _MainTex_TexelSize.z is set in a way, such that it gives pixel coordinates. This is what you get (the boxes represent pixels, numbers in boxes the pixel number and numbers below the pixel space coordinates):
With your sampling (assuming nearest point sampling), you will get pixels 2 and 3. However, you see that the interpolation coordinate for lerp is actually wrong. You will pass the fractional part of the texture position (i.e. 0.8) but it should be 0.3 (= 0.8 - 0.5). The reasoning behind this is quite simple: If you land at the center of a pixel, you want to use the pixel value. If you land right in the middle between two pixels, you want to use the average of both pixel values (i.e. an interpolation value of 0.5). Right now, you have basically an offset by a half pixel to the left.
When you solve the first problem, there is a second one:
In this case, you actually want to blend between pixel 1 and 2. But because you always go to the right in your sampling, you will blend between 2 and 3. Again, with a wrong interpolation value.
The solution should be quite simple: Subtract half of the pixel width from the texture coordinate before doing anything with it, which is probably just the following (assuming that your variables hold the things I think):
fixed4 c = GetBilinearFilteredColor(IN.texcoord - 0.5 * _MainTex_TexelSize.xy) * IN.color;
Another reason why the results are different could be that Unity actually uses a different filter, e.g. bicubic (but I don't know). Also, the usage of mipmaps could influence the result.

Efficient Bicubic filtering code in GLSL?

I'm wondering if anyone has complete, working, and efficient code to do bicubic texture filtering in glsl. There is this:
http://www.codeproject.com/Articles/236394/Bi-Cubic-and-Bi-Linear-Interpolation-with-GLSL
or
https://github.com/visionworkbench/visionworkbench/blob/master/src/vw/GPU/Shaders/Interp/interpolation-bicubic.glsl
but both do 16 texture reads where only 4 are necessary:
https://groups.google.com/forum/#!topic/comp.graphics.api.opengl/kqrujgJfTxo
However the method above uses a missing "cubic()" function that I don't know what it is supposed to do, and also takes an unexplained "texscale" parameter.
There is also the NVidia version:
https://developer.nvidia.com/gpugems/gpugems2/part-iii-high-quality-rendering/chapter-20-fast-third-order-texture-filtering
but I believe this uses CUDA, which is specific to NVidia's cards. I need glsl.
I could probably port the nvidia version to glsl, but thought I'd ask first to see if anyone already has a complete, working glsl bicubic shader.
I found this implementation which can be used as a drop-in replacement for texture() (from http://www.java-gaming.org/index.php?topic=35123.0 (one typo fixed)):
// from http://www.java-gaming.org/index.php?topic=35123.0
vec4 cubic(float v){
vec4 n = vec4(1.0, 2.0, 3.0, 4.0) - v;
vec4 s = n * n * n;
float x = s.x;
float y = s.y - 4.0 * s.x;
float z = s.z - 4.0 * s.y + 6.0 * s.x;
float w = 6.0 - x - y - z;
return vec4(x, y, z, w) * (1.0/6.0);
}
vec4 textureBicubic(sampler2D sampler, vec2 texCoords){
vec2 texSize = textureSize(sampler, 0);
vec2 invTexSize = 1.0 / texSize;
texCoords = texCoords * texSize - 0.5;
vec2 fxy = fract(texCoords);
texCoords -= fxy;
vec4 xcubic = cubic(fxy.x);
vec4 ycubic = cubic(fxy.y);
vec4 c = texCoords.xxyy + vec2 (-0.5, +1.5).xyxy;
vec4 s = vec4(xcubic.xz + xcubic.yw, ycubic.xz + ycubic.yw);
vec4 offset = c + vec4 (xcubic.yw, ycubic.yw) / s;
offset *= invTexSize.xxyy;
vec4 sample0 = texture(sampler, offset.xz);
vec4 sample1 = texture(sampler, offset.yz);
vec4 sample2 = texture(sampler, offset.xw);
vec4 sample3 = texture(sampler, offset.yw);
float sx = s.x / (s.x + s.y);
float sy = s.z / (s.z + s.w);
return mix(
mix(sample3, sample2, sx), mix(sample1, sample0, sx)
, sy);
}
Example: Nearest, bilinear, bicubic:
The ImageData of this image is
{{{0.698039, 0.996078, 0.262745}, {0., 0.266667, 1.}, {0.00392157,
0.25098, 0.996078}, {1., 0.65098, 0.}}, {{0.996078, 0.823529,
0.}, {0.498039, 0., 0.00392157}, {0.831373, 0.00392157,
0.00392157}, {0.956863, 0.972549, 0.00784314}}, {{0.909804,
0.00784314, 0.}, {0.87451, 0.996078, 0.0862745}, {0.196078,
0.992157, 0.760784}, {0.00392157, 0.00392157, 0.498039}}, {{1.,
0.878431, 0.}, {0.588235, 0.00392157, 0.00392157}, {0.00392157,
0.0666667, 0.996078}, {0.996078, 0.517647, 0.}}}
I tried to reproduce this (many other interpolation techniques)
but they have clamped padding, while I have repeating (wrapping) boundaries. Therefore it is not exactly the same.
It seems this bicubic business is not a proper interpolation, i.e. it does not take on the original values at the points where the data is defined.
I decided to take a minute to dig my old Perforce activities and found the missing cubic() function; enjoy! :)
vec4 cubic(float v)
{
vec4 n = vec4(1.0, 2.0, 3.0, 4.0) - v;
vec4 s = n * n * n;
float x = s.x;
float y = s.y - 4.0 * s.x;
float z = s.z - 4.0 * s.y + 6.0 * s.x;
float w = 6.0 - x - y - z;
return vec4(x, y, z, w);
}
Wow. I recognize the code above (I can not comment w/ reputation < 50) as I came up with it in early 2011. The problem I was trying to solve was related to old IBM T42 (sorry the exact model number escapes me) laptop and it's ATI graphics stack. I developed the code on NV card and originally I used 16 texture fetches. That was kinda of slow but fast enough for my purposes. When someone reported it did not work on his laptop it became apparent that they did not support enough texture fetches per fragment. I had to engineer a work-around and the best I could come up with was to do it with number of texture fetches that would work.
I thought about it like this: okay, so if I handle each quad (2x2) with linear filter the remaining problem is can the rows and columns share the weights? That was the only problem on my mind when I set out to craft the code. Of course they could be shared; the weights are same for each column and row; perfect!
Now I had four samples. The remaining problem was how to correctly combine the samples. That was the biggest obstacle to overcome. It took about 10 minutes with pencil and paper. With trembling hands I typed the code in and it worked, nice. Then I uploaded the binaries to the guy who promised to check it out on his T42 (?) and he reported it worked. The end. :)
I can assure that the equations check out and give mathematically identical results to computing the samples individually. FYI: with CPU it's faster to do horizontal and vertical scan separately. With GPU multiple passes is not that great idea, especially when it's probably not feasible anyway in typical use case.
Food for thought: it is possible to use a texture lookup for the cubic() function. Which is faster depends on the GPU but generally speaking, the sampler is light on the ALU side just doing the arithmetic would balance things out. YMMV.
The missing function cubic() in JAre's answer could look like this:
vec4 cubic(float x)
{
float x2 = x * x;
float x3 = x2 * x;
vec4 w;
w.x = -x3 + 3*x2 - 3*x + 1;
w.y = 3*x3 - 6*x2 + 4;
w.z = -3*x3 + 3*x2 + 3*x + 1;
w.w = x3;
return w / 6.f;
}
It returns the four weights for cubic B-Spline.
It is all explained in NVidia Gems.
(EDIT)
Cubic() is a cubic spline function
Example:
Texscale is sampling window size coefficient. You can start with 1.0 value.
vec4 filter(sampler2D texture, vec2 texcoord, vec2 texscale)
{
float fx = fract(texcoord.x);
float fy = fract(texcoord.y);
texcoord.x -= fx;
texcoord.y -= fy;
vec4 xcubic = cubic(fx);
vec4 ycubic = cubic(fy);
vec4 c = vec4(texcoord.x - 0.5, texcoord.x + 1.5, texcoord.y -
0.5, texcoord.y + 1.5);
vec4 s = vec4(xcubic.x + xcubic.y, xcubic.z + xcubic.w, ycubic.x +
ycubic.y, ycubic.z + ycubic.w);
vec4 offset = c + vec4(xcubic.y, xcubic.w, ycubic.y, ycubic.w) /
s;
vec4 sample0 = texture2D(texture, vec2(offset.x, offset.z) *
texscale);
vec4 sample1 = texture2D(texture, vec2(offset.y, offset.z) *
texscale);
vec4 sample2 = texture2D(texture, vec2(offset.x, offset.w) *
texscale);
vec4 sample3 = texture2D(texture, vec2(offset.y, offset.w) *
texscale);
float sx = s.x / (s.x + s.y);
float sy = s.z / (s.z + s.w);
return mix(
mix(sample3, sample2, sx),
mix(sample1, sample0, sx), sy);
}
Source
For anybody interested in GLSL code to do tri-cubic interpolation, ray-casting code using cubic interpolation can be found in the examples/glCubicRayCast folder in:
http://www.dannyruijters.nl/cubicinterpolation/CI.zip
edit: The cubic interpolation code is now available on github: CUDA version and WebGL version, and GLSL sample.
I've been using #Maf 's cubic spline recipe for over a year, and I recommend it, if a cubic B-spline meets your needs.
But I recently realized that, for my particular application, it is important for the intensities to match exactly at the sample points. So I switched to using a Catmull-Rom spline, which uses a slightly different recipe like so:
// Catmull-Rom spline actually passes through control points
vec4 cubic(float x) // cubic_catmullrom(float x)
{
const float s = 0.5; // potentially adjustable parameter
float x2 = x * x;
float x3 = x2 * x;
vec4 w;
w.x = -s*x3 + 2*s*x2 - s*x + 0;
w.y = (2-s)*x3 + (s-3)*x2 + 1;
w.z = (s-2)*x3 + (3-2*s)*x2 + s*x + 0;
w.w = s*x3 - s*x2 + 0;
return w;
}
I found these coefficients, plus those for a number of other flavors of cubic splines, in the lecture notes at:
http://www.cs.cmu.edu/afs/cs/academic/class/15462-s10/www/lec-slides/lec06.pdf
I think it is possible that the Catmull version could be done with 4 texture lookups by (a) arranging the input texture like a chessboard with alternate slots saved as positives and as negatives, and (b) an associated modification of textureBicubic. That would rely on the contributions/weights w.x/w.w always being negative, and the contributions w.y/w.z always being positive. I haven't double-checked if this is true, or exactly how the modified textureBicubic would look.
... I have verified that w contributions do satisfy the +ve -ve rules.

GLSL gl_FragCoord.z Calculation and Setting gl_FragDepth

So, I've got an imposter (the real geometry is a cube, possibly clipped, and the imposter geometry is a Menger sponge) and I need to calculate its depth.
I can calculate the amount to offset in world space fairly easily. Unfortunately, I've spent hours failing to perturb the depth with it.
The only correct results I can get are when I go:
gl_FragDepth = gl_FragCoord.z
Basically, I need to know how gl_FragCoord.z is calculated so that I can:
Take the inverse transformation from gl_FragCoord.z to eye space
Add the depth perturbation
Transform this perturbed depth back into the same space as the original gl_FragCoord.z.
I apologize if this seems like a duplicate question; there's a number of other posts here that address similar things. However, after implementing all of them, none work correctly. Rather than trying to pick one to get help with, at this point, I'm asking for complete code that does it. It should just be a few lines.
For future reference, the key code is:
float far=gl_DepthRange.far; float near=gl_DepthRange.near;
vec4 eye_space_pos = gl_ModelViewMatrix * /*something*/
vec4 clip_space_pos = gl_ProjectionMatrix * eye_space_pos;
float ndc_depth = clip_space_pos.z / clip_space_pos.w;
float depth = (((far-near) * ndc_depth) + near + far) / 2.0;
gl_FragDepth = depth;
For another future reference, this is the same formula as given by imallett, which was working for me in an OpenGL 4.0 application:
vec4 v_clip_coord = modelview_projection * vec4(v_position, 1.0);
float f_ndc_depth = v_clip_coord.z / v_clip_coord.w;
gl_FragDepth = (1.0 - 0.0) * 0.5 * f_ndc_depth + (1.0 + 0.0) * 0.5;
Here, modelview_projection is 4x4 modelview-projection matrix and v_position is object-space position of the pixel being rendered (in my case calculated by a raymarcher).
The equation comes from the window coordinates section of this manual. Note that in my code, near is 0.0 and far is 1.0, which are the default values of gl_DepthRange. Note that gl_DepthRange is not the same thing as the near/far distance in the formula for perspective projection matrix! The only trick is using the 0.0 and 1.0 (or gl_DepthRange in case you actually need to change it), I've been struggling for an hour with the other depth range - but that is already "baked" in my (perspective) projection matrix.
Note that this way, the equation really contains just a single multiply by a constant ((far - near) / 2) and a single addition of another constant ((far + near) / 2). Compare that to multiply, add and divide (possibly converted to a multiply by an optimizing compiler) that is required in the code of imallett.

gluProject on NDS?

I've been struggling with this for a good while now. I'm trying to determine the screen coordinates of the vertexes in a model on the screen of my NDS using devKitPro. The library seems to implement some functionality of OpenGL, but in particular, the gluProject function is missing, which would (I assume) allow me to do just exactly that, easily.
I've been trying for a good while now to calculate the screen coordinates manually using the projection matricies that are stored in the DS's registers, but I haven't been having much luck, even when trying to build the projection matrix from scratch based on OpenGL's documentation. Here is the code I'm trying to use:
void get2DPoint(v16 x, v16 y, v16 z, float &result_x, float &result_y)
{
//Wait for the graphics engine to be ready
/*while (*(int*)(0x04000600) & BIT(27))
continue;*/
//Read in the matrix that we're currently transforming with
double currentMatrix[4][4]; int i;
for (i = 0; i < 16; i++)
currentMatrix[0][i] =
(double(((int*)0x04000640)[i]))/(double(1<<12));
//Now this hurts-- take that matrix, and multiply it by the projection matrix, so we obtain
//proper screen coordinates.
double f = 1.0 / tan(70.0/2.0);
double aspect = 256.0/192.0;
double zNear = 0.1;
double zFar = 40.0;
double projectionMatrix[4][4] =
{
{ (f/aspect), 0.0, 0.0, 0.0 },
{ 0.0, f, 0.0, 0.0 },
{ 0.0, 0.0, ((zFar + zNear) / (zNear - zFar)), ((2*zFar*zNear)/(zNear - zFar)) },
{ 0.0, 0.0, -1.0, 0.0 },
};
double finalMatrix[4][4];
//Ugh...
int mx = 0; int my = 0;
for (my = 0; my < 4; my++)
for (mx = 0; mx < 4; mx++)
finalMatrix[mx][my] =
currentMatrix[my][0] * projectionMatrix[0][mx] +
currentMatrix[my][1] * projectionMatrix[1][mx] +
currentMatrix[my][2] * projectionMatrix[2][mx] +
currentMatrix[my][3] * projectionMatrix[3][mx] ;
double dx = ((double)x) / (double(1<<12));
double dy = ((double)y) / (double(1<<12));
double dz = ((double)z) / (double(1<<12));
result_x = dx*finalMatrix[0][0] + dy*finalMatrix[0][1] + dz*finalMatrix[0][2] + finalMatrix[0][3];
result_y = dx*finalMatrix[1][0] + dy*finalMatrix[1][1] + dz*finalMatrix[1][2] + finalMatrix[1][3];
result_x = ((result_x*1.0) + 4.0)*32.0;
result_y = ((result_y*1.0) + 4.0)*32.0;
printf("Result: %f, %f\n", result_x, result_y);
}
There are lots of shifts involved, the DS works internally using fixed point notation and I need to convert that to doubles to work with. What I'm getting seems to be somewhat correct-- the pixels are translated perfectly if I'm using a flat quad that's facing the screen, but the rotation is wonky. Also, since I'm going by the projection matrix (which accounts for the screen width/height?) the last steps I'm needing to use don't seem right at all. Shouldn't the projection matrix be accomplishing the step up to screen resolution for me?
I'm rather new to all of this, I've got a fair grasp on matrix math, but I'm not as skilled as I would like to be in 3D graphics. Does anyone here know a way, given the 3D, non-transformed coordinates of a model's vertexes, and also given the matricies which will be applied to it, to actually come up with the screen coordinates, without using OpenGL's gluProject function? Can you see something blatantly obvious that I'm missing in my code? (I'll clarify when possible, I know it's rough, this is a prototype I'm working on, cleanliness isn't a high priority)
Thanks a bunch!
PS: As I understand it, currentMatrix, which I pull from the DS's registers, should be giving me the combined projection, translation, and rotation matrix, as it should be the exact matrix that's going to be used for the translation by the DS's own hardware, at least according to the specs at GBATEK. In practise, it doesn't seem to actually have the projection coordinates applied to it, which I suppose has something to do with my issues. But I'm not sure, as calculating the projection myself isn't generating different results.
That is almost correct.
The correct steps are:
Multiply Modelview with Projection matrix (as you've already did).
Extend your 3D vertex to a homogeneous coordinate by adding a W-component with value 1. E.g your (x,y,z)-vector becomes (x,y,z,w) with w = 1.
Multiply this vector with the matrix product. Your matrix should be 4x4 and your vector of size 4. The result will be a vector of size4 as well (don't drop w yet!). The result of this multiplication is your vector in clip-space. FYI: You can already do a couple of very useful things here with this vector: Test if the point is on the screen. The six conditions are:
x &lt -w : Point is outside the screen (left of the viewport)
x &gt W : Point is outside the screen (right of the viewport)
y &lt -w : Point is outside the screen (above the viewport)
y &gt w : Point is outside the screen (below the viewport)
z &lt -w : Point is outside the screen (beyond znear)
z &gt w : Point is outside the screen (beyond zfar)
Project your point into 2D space. To do this divide x and y by w:
x' = x / w;
y' = y / w;
If you're interested in the depth-value (e.g. what gets written to the zbuffer) you can project z as well:
z' = z / w
Note that the previous step won't work if w is zero. This case happends if your point is equal to the camera position. The best you could do in this case is to set x' and y' to zero. (will move the point into the center of the screen in the next step..).
Final Step: Get the OpenGL viewport coordinates and apply it:
x_screen = viewport_left + (x' + 1) * viewport_width * 0.5;
y_screen = viewport_top + (y' + 1) * viewport_height * 0.5;
Important: The y coordinate of your screen may be upside down. Contrary to most other graphic APIs in OpenGL y=0 denotes the bottom of the screen.
That's all.
I'll add some more thoughts to Nils' thorough answer.
don't use doubles. I'm not familiar with NDS, but I doubt it's got any hardware for double math.
I also doubt model view and projection are not already multiplied if you are reading the hardware registers. I have yet to see a hardware platform that does not use the full MVP in the registers directly.
the matrix storage into registers may or may not be in the same order as OpenGL. if they are not, the multiplication matrix-vector needs to be done in the other order.