COLLADA: Inverse bind pose in the wrong space? - c++

I'm working on writing my own COLLADA importer. I've gotten pretty far, loading meshes and materials and such. But I've hit a snag on animation, specifically: joint rotations.
The formula I'm using for skinning my meshes is straight-forward:
weighted;
for (i = 0; i < joint_influences; i++)
{
weighted +=
joint[joint_index[i]]->parent->local_matrix *
joint[joint_index[i]]->local_matrix *
skin->inverse_bind_pose[joint_index[i]] *
position *
skin->weight[j];
}
position = weighted;
And as far as the literature is concerned, this is the correct formula. Now, COLLADA specifies two types of rotations for the joints: local and global. You have to concatenate the rotations together to get the local transformation for the joint.
What the COLLADA documentation does not differentiate between is the joint's local rotation and the joint's global rotation. But in most of the models I've seen, rotations can have an id of either rotate (global) or jointOrient (local).
When I disregard the global rotations and only use the local ones, I get the bind pose for the model. But when I add the global rotations to the joint's local transformation, strange things start to happen.
This is without using global rotations:
And this is with global rotations:
In both screenshots I'm drawing the skeleton using lines, but in the first it's invisible because the joints are inside the mesh. In the second screenshot the vertices are all over the place!
For comparison, this is what the second screenshot should look like:
It's hard to see, but you can see that the joints are in the correct position in the second screenshot.
But now the weird thing. If I disregard the inverse bind pose as specified by COLLADA and instead take the inverse of the joint's parent local transform times the joint's local transform, I get the following:
In this screenshot I'm drawing a line from each vertex to the joints that have influence. The fact that I get the bind pose is not so strange, because the formula now becomes:
world_matrix * inverse_world_matrix * position * weight
But it leads me to suspect that COLLADA's inverse bind pose is in the wrong space.
So my question is: in what space does COLLADA specifies its inverse bind pose? And how can I transform the inverse bind pose to the space I need?

I started by comparing my values to the ones I read from Assimp (an open source model loader). Stepping through the code I looked at where they built their bind matrices and their inverse bind matrices.
Eventually I ended up in SceneAnimator::GetBoneMatrices, which contains the following:
// Bone matrices transform from mesh coordinates in bind pose to mesh coordinates in skinned pose
// Therefore the formula is offsetMatrix * currentGlobalTransform * inverseCurrentMeshTransform
for( size_t a = 0; a < mesh->mNumBones; ++a)
{
const aiBone* bone = mesh->mBones[a];
const aiMatrix4x4& currentGlobalTransform
= GetGlobalTransform( mBoneNodesByName[ bone->mName.data ]);
mTransforms[a] = globalInverseMeshTransform * currentGlobalTransform * bone->mOffsetMatrix;
}
globalInverseMeshTransform is always identity, because the mesh doesn't transform anything. currentGlobalTransform is the bind matrix, the joint's parent's local matrices concatenated with the joint's local matrix. And mOffsetMatrix is the inverse bind matrix, which comes directly from the skin.
I checked the values of these matrices to my own (oh yes I compared them in a watch window) and they were exactly the same, off by maybe 0.0001% but that's insignificant. So why does Assimp's version work and mine doesn't even though the formula is the same?
Here's what I got:
When Assimp finally uploads the matrices to the skinning shader, they do the following:
helper->piEffect->SetMatrixTransposeArray( "gBoneMatrix", (D3DXMATRIX*)matrices, 60);
Waaaaait a second. They upload them transposed? It couldn't be that easy. No way.
Yup.
Something else I was doing wrong: I was converting the coordinates the right system (centimeters to meters) before applying the skinning matrices. That results in completely distorted models, because the matrices are designed for the original coordinate system.
FUTURE GOOGLERS
Read all the node transforms (rotate, translation, scale, etc.) in the order you receive them.
Concatenate them to a joint's local matrix.
Take the joint's parent and multiply it with the local matrix.
Store that as the bind matrix.
Read the skin information.
Store the joint's inverse bind pose matrix.
Store the joint weights for each vertex.
Multiply the bind matrix with the inverse bind pose matrix and transpose it, call it the skinning matrix.
Multiply the skinning matrix with the position times the joint weight and add it to the weighted position.
Use the weighted position to render.
Done!

BTW, if you transpose the matrices upon loading them rather than transposing the matrix at the end (which can be problematic when animating) you want to perform your multiplication differently (the method you use above appears to be for using skinning in DirectX when using OpenGL friendly matrices - ergo the transpose.)
In DirectX I transpose matrices when they are loaded from the file and then I use (in the example below I am simply applying the bind pose for the sake of simplicity):
XMMATRIX l_oWorldMatrix = XMMatrixMultiply( l_oBindPose, in_oParentWorldMatrix );
XMMATRIX l_oMatrixPallette = XMMatrixMultiply( l_oInverseBindPose, l_oWorldMatrix );
XMMATRIX l_oFinalMatrix = XMMatrixMultiply( l_oBindShapeMatrix, l_oMatrixPallette );

Related

opengl lookup matrix to camera extrinsic matrix

I am trying to render 3D point cloud from the depth data which I saved from opengl framebuffer. Basically, I took different depth samples from different n viewpoints (which are already known) for the rendered model centered at (0, 0, 0). I successfully saved the depth maps but now I want to extract x, y, z coordinated from these depth maps. For this, I am back projecting point from image to world. To get world coordinates I use the following equation P = K_inv [R|t]_inv * p. to calculate the world coordinates.
To calculate the image intrinsics matrix I used information from the opengl camera matrix, glm::perspective(fov, aspect, near_plane, far_plane). The intrinsic matrix K is calculated as
where
If I transform the coordinates in camera origin (i.e., no extrinsic transformation [R|t]), I get a 3D model for a single Image. To fuse multiple depths maps, I also need extrinsic transformation which I am calculating as from the OpenGL lookat matrix glm::lookat(eye=n_viewpoint_coorinates, center=(0, 0, 0), up=(0, 1, 0)). The extrisnics matrix is calculated as below (ref: http://ksimek.github.io/2012/08/22/extrinsic/
But when I fuse two depth images they are misaligned. I think the extrinsic matrix is not correct. I also tried to use glm::lookat matrix directly but that does not work as well. The fused model snapshot is shown below
Can someone suggest, what is wrong with my approach. Is it the extrinsic matrix that is wrong (which I am damn sure of)?
Finally, I managed to solve this by myself. Instead of doing transformation inside the OpenGL, I did the transformation outside of the OpenGL. Basically, I kept the camera constant and at some distance from the model and did rotation transformation on the model, and then finally render the model without lookat matrix (or just 4x4 identity matrix). I don't know why using lookat matrix does not gave me the result or maybe it is due something I was missing. To backproject the model into world coordinates I would just take the inverse of the exact transformation I did initially before feeding the model to OpenGL.

How to compute vector transformations on the CPU?

[enter image description here][1]
I’ve been trying to transform my vertices outside the main graphics pipeline. I need them on the CPU. But as simple as it seemed at first, I have spent a significant duration trying to implement that but simply failed. I have been trying to figure out the error with my method, but it just seems perfect to me.
I have my world, camera and projection matrices (that I am using in the main graphics pipeline to render objects) working. I use the same matrices to transform the vectors with the function XMVector4Transform(). I have set the w component of my vector to 1 and then when I transform my vertices, instead of getting normalized (between -1 to 1) outputs (while the 3d model is inside the screen space), I am getting values that are outside the screen while with the same matrix transformations in the shader, it is being rendered inside the screen space.
Now after some digging I found that I need to use the function XMVector4Normalize() to normalize the coords. Though after using that the results were normalized, but still there is a major offset between the CPU computed vertices and those that I compute in the shader. And the offset margin increases as I move the objects to the edges.
https://i.stack.imgur.com/bhnpB.png
in the above screenshot, the wireframe is rendered using the CPU computed vertices and the solidly shaded version is being rendered in the main pipeline. the offset that i mentioned can be clearly observed in the screenshot.
PS : I am rendering the CPU computed verts just to test...
DirectX::XMVECTOR v1;
v1.m128_f32[0] = pMesh[i].GetVertices()[j].x;
v1.m128_f32[1] = pMesh[i].GetVertices()[j].y;
v1.m128_f32[2] = pMesh[i].GetVertices()[j].z;
v1.m128_f32[3] = 1.f;
projectedVectors[i].verts.emplace_back(XMVECTOR());
v1 = XMVector4Transform(XMVector4Transform(v1, *mView), *mProj);;
v1 = XMVector4Normalize(v1);
The result you get from this transform is in so-called clip space. This is an artificial 4D-space, where the w-component denotes the common denominator for all other components. Therefore, instead of normalizing, you want to divide by w (so-called perspective divide).
Btw, instead of assigning the components of the vector to the registers yourself, you could also use XMLoadFloat4(). This would take care of everything taking into account what is available.

Matrix calculations for gpu skinning

I'm trying to do skeletal animation in OpenGL using Assimp as my model import library.
What exactly do I need to the with the bones' offsetMatrix variable? What do I need to multiply it by?
Let's take for instance this code, which I used to animate characters in a game I worked. I used Assimp too, to load bone information and I read myself the OGL tutorial already pointed out by Nico.
glm::mat4 getParentTransform()
{
if (this->parent)
return parent->nodeTransform;
else
return glm::mat4(1.0f);
}
void updateSkeleton(Bone* bone = NULL)
{
bone->nodeTransform = bone->getParentTransform() // This retrieve the transformation one level above in the tree
* bone->transform //bone->transform is the assimp matrix assimp_node->mTransformation
* bone->localTransform; //this is your T * R matrix
bone->finalTransform = inverseGlobal // which is scene->mRootNode->mTransformation from assimp
* bone->nodeTransform //defined above
* bone->boneOffset; //which is ai_mesh->mBones[i]->mOffsetMatrix
for (int i = 0; i < bone->children.size(); i++) {
updateSkeleton (&bone->children[i]);
}
}
Essentially the GlobalTransform as it is referred in the tutorial Skeletal Animation with Assimp or properly the transform of the root node scene->mRootNode->mTransformation is the transformation from local space to global space. To give you an example, when in a 3D modeler (let's pick Blender for instance) you create your mesh or you load your character, it is usually positioned (by default) at the origin of the Cartesian plane and its rotation is set to the identity quaternion.
However you can translate/rotate your mesh/character from the origin (0,0,0) to somewhere else and have in a single scene even multiple meshes with different positions. When you load them, especially if you do skeletal animation, it is mandatory to translate them back in local space (i.e. back at the origin 0,0,0 ) and this is the reason why you have to multiply everything by the InverseGlobal (which brings back your mesh to local space).
After that you need to multiply it by the node transform which is the multiplication of the parentTransform (the transformation one level up in the tree, this is the overall transform) the transform (formerly the assimp_node->mTransformation which is just the transformation of the bone relative to the node's parent) and your local transformation (any T * R) you want to apply to do: forward kinematic, inverse kinematic or key-frame interpolation.
Eventually there is the boneOffset (ai_mesh->mBones[i]->mOffsetMatrix) that transforms from mesh space to bone space in bind pose as stated in the documentation.
Here there is a link to GitHub if you want to look at the whole code for my Skeleton class.
Hope it helps.
The offset matrix defines the transform (translation, scale, rotation) that transforms the vertex in mesh space, and converts it to "bone" space. As an example consider the following vertex and a bone with the following properties;
Vertex Position<0, 1, 2>
Bone Position<10, 2, 4>
Bone Rotation<0,0,0,1> // Note - no rotation
Bone Scale<1, 1, 1>
If we multiply a vertex by the offset Matrix in this case we would get a vertex position of <-10, -1, 2>.
How do we use this? You have two options on how to use this matrix which is down to how we store the vertex data in the vertex buffers. The options are;
1) Store the mesh vertices in mesh space
2) Store the mesh vertices in bone space
In the case of #1, we would take the offsetMatrix and apply it to the vertices that are influenced by the bone as we build the vertex buffer. And then when we animate the mesh, we later apply the animated matrix for that bone.
In the case of #2, we would use the offsetMatrix in combination with the animation matrix for that bone when transforming the vertices stored in the vertex buffer. So it would be something like (Note: you may have to switch the matrix concatenations around here);
anim_vertex = (offset_matrix * anim_matrix) * mesh_vertex
Does this help?
As I already assumed, the mOffsetMatrix is the inverse bind pose matrix. This tutorial states the correct transformations that you need for linear blend skinning:
You first need to evaluate your animation state. This will give you a system transform from animated bone space to world space for every bone (GlobalTransformation in the tutorial). The mOffsetMatrix is the system transform from world space to bind pose bone space. Therefore, what you do for skinning is the following (assuming that a specific vertex is influenced by a single bone): Transform the vertex to bone space with mOffsetMatrix. Now assume an animated bone and transform the intermediate result back from animated bone space to world space. So:
boneMatrix[i] = animationMatrix[i] * mOffsetMatrix[i]
If the vertex is influenced by multiple bones, LBS simply averages the results. That's where the weights come into play. Skinning is usually implemented in a vertex shader:
vec4 result = vec4(0);
for each influencing bone i
result += weight[i] * boneMatrix[i] * vertexPos;
Usually, the maximum number of influencing bones is fixed and you can unroll the for loop.
The tutorial uses an additional m_GlobalInverseTransform for the boneMatrix. However, I have no clue why they do that. Basically, this undoes the overall transformation of the entire scene. Probably it is used to center the model in the view.

OpenGL : Bone Animation, Why Do I Need Inverse of Bind Pose When Working with GPU?

I implemented an MD5 Loader with software skinning. Bind pose in md5 is final, absolute position and rotations, you just need to do computations for weights which are joint dependent.
I tried to implement GPU skinning but i am stuck at a point. Since these coordinates are final, why can't i just convert my 3d vectors and quaternions into a matrix and just upload it to the shader ? As I have read here : http://3dgep.com/?p=1356 , i need to multiply my skeleton with inverse of the bind pose. But I don't understand this part because I always thought that only thing I need to do is upload the final matrices to the GPU and calculate the rest there (sum of weights etc. etc.)
Can you explain me the behavior of inverse bind pose ?
As the original author of that article, I will try to explain what multiplying by the inverse bind pose does:
The "inverse bind pose" basically "undoes" any transformation that has already been applied to your model in its bind pose.
Consider it like this:
If you apply the identity matrix to every joint in the model then what you will get is your model in the bind pose (you can try this by sending a skeleton frame filled with identity matrices. If what results is the bind pose, then you are doing it right).
If you apply the bind pose matrices (uninverted) to every joint in the model then what you will get is spaghetti because you would be applying the bind pose twice!
So to fix the spaghetti model, you simply multiply the resulting joint transformations by the inverse bind pose to "undo" the transformation that have already been applied to your model.
I hope this clears it up a bit...
Honestly, the article is a bit much to completely work through. It seems that the inverse bind pose matrices are used to transform vertices to the bones' local coordinate systems.
This is necessary, because the bones' transformations are local (relative to their parent joints). So in order to animate a vertex, you have to transform it to a bone's local coordinate system, calculate the bone's local transforms and transform it back to the world system.

Confused about OpenGL transformations

In opengl there is one world coordinate system with origin (0,0,0).
What confuses me is what all the transformations like glTranslate, glRotate, etc. do? Do they move
objects in world coordinates, or do they move the camera? As you know, the same movement can be achieved by either moving objects or camera.
I am guessing that glTranslate, glRotate, change objects, and gluLookAt changes the camera?
In opengl there is one world coordinate system with origin (0,0,0).
Well, technically no.
What confuses me is what all the transformations like glTranslate, glRotate, etc. do? Do they move objects in world coordinates, or do they move the camera?
Neither. OpenGL doesn't know objects, OpenGL doesn't know a camera, OpenGL doesn't know a world. All that OpenGL cares about are primitives, points, lines or triangles, per vertex attributes, normalized device coordinates (NDC) and a viewport, to which the NDC are mapped to.
When you tell OpenGL to draw a primitive, each vertex is processed according to its attributes. The position is one of the attributes and usually a vector with 1 to 4 scalar elements within local "object" coordinate system. The task at hand is to somehow transform the local vertex position attribute into a position on the viewport. In modern OpenGL this happens within a small program, running on the GPU, called a vertex shader. The vertex shader may process the position in an arbitrary way. But the usual approach is by applying a number of nonsingular, linear transformations.
Such transformations can be expressed in terms of homogenous transformation matrices. For a 3 dimensional vector, the homogenous representation in a vector with 4 elements, where the 4th element is 1.
In computer graphics a 3-fold transformation pipeline has become sort of the standard way of doing things. First the object local coordinates are transformed into coordinates relative to the virtual "eye", hence into eye space. In OpenGL this transformation used to be called the modelview transformaion. With the vertex positions in eye space several calculations, like illumination can be expressed in a generalized way, hence those calculations happen in eye space. Next the eye space coordinates are tranformed into the so called clip space. This transformation maps some volume in eye space to a specific volume with certain boundaries, to which the geometry is clipped. Since this transformation effectively applies a projection, in OpenGL this used to be called the projection transformation.
After clip space the positions get "normalized" by their homogenous component, yielding normalized device coordinates, which are then plainly mapped to the viewport.
To recapitulate:
A vertex position is transformed from local to clip space by
vpos_eye = MV · vpos_local
eyespace_calculations(vpos_eye);
vpos_clip = P · vpos_eye
·: inner product column on row vector
Then to reach NDC
vpos_ndc = vpos_clip / vpos_clip.w
and finally to the viewport (NDC coordinates are in the range [-1, 1]
vpos_viewport = (vpos_ndc + (1,1,1,1)) * (viewport.width, viewport.height) / 2 + (viewport.x, viewport.y)
*: vector component wise multiplication
The OpenGL functions glRotate, glTranslate, glScale, glMatrixMode merely manipulate the transformation matrices. OpenGL used to have four transformation matrices:
modelview
projection
texture
color
On which of them the matrix manipulation functions act on can be set using glMatrixMode. Each of the matrix manipulating functions composes a new matrix by multiplying the transformation matrix they describe on top of the select matrix thereby replacing it. The functions glLoadIdentity replace the current matrix with identity, glLoadMatrix replaces it with a user defined matrix, and glMultMatrix multiplies a user defined matrix on top of it.
So how does the modelview matrix then emulate both object placement and a camera. Well, as you already stated
As you know, the same movement can be achieved by either moving objects or camera.
You can not really discern between them. The usual approach is by splitting the object local to eye transformation into two steps:
Object to world – OpenGL calls this the "model transform"
World to eye – OpenGL calls this the "view transform"
Together they form the model-view, in fixed function OpenGL described by the modelview matrix. Now since the order of transformations is
local to world, Model matrix vpos_world = M · vpos_local
world to eye, View matrix vpos_eye = V · vpos_world
we can substitute by
vpos_eye = V · ( M · vpos_local ) = V · M · vpos_local
replacing V · M by the ModelView matrix =: MV
vpos_eye = MV · vpos_local
Thus you can see that what's V and what's M of the compund matrix M is only determined by the order of operations in which you multiply onto the modelview matrix, and at which step you decide to "call it the model transform from here on".
I.e. right after a
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
the view is defined. But at some point you'll start applying model transformations and everything after is model.
Note that in modern OpenGL all the matrix manipulation functions have been removed. OpenGL's matrix stack never was feature complete and no serious application did actually use it. Most programs just glLoadMatrix-ed their self calculated matrices and didn't bother with the OpenGL built-in matrix maniupulation routines.
And ever since shaders were introduced, the whole OpenGL matrix stack got awkward to use, to say it nicely.
The verdict: If you plan on using OpenGL the modern way, don't bother with the built-in functions. But keep in mind what I wrote, because what your shaders do will be very similar to what OpenGL's fixed function pipeline did.
OpenGL is a low-level API, there is no higher-level concepts like an "object" and a "camera" in the "scene", so there are only two matrix modes: MODELVIEW (a multiplication of "camera" matrix by the "object" transformation) and PROJECTION (the projective transformation from world-space to post-perspective space).
Distinction between "Model" and "View" (object and camera) matrices is up to you. glRotate/glTranslate functions just multiply the currently selected matrix by the given one (without even distinguishing between ModelView and Projection).
Those functions multiply (transform) the current matrix set by glMatrixMode() so it depends on the matrix you're working on. OpenGL has 4 different types of matrices; GL_MODELVIEW, GL_PROJECTION, GL_TEXTURE, and GL_COLOR, any one of those functions can change any of those matrices. So, basically, you don't transform objects you just manipulate different matrices to "fake" that effect.
Note that glulookat() is just a convenient function equivalent to a translation followed by some rotations, there's nothing special about it.
All transformations are transformations on objects. Even gluLookAt is just a transformation to transform the objects as if the camera was where you tell it to be. Technically they are transformations on the vertices, but that's just semantics.
That's true, glTranslate, glRotate change the object coordinates before rendering and gluLookAt changes the camera coordinate.