How can I calculate camera position by comparing two photographs? - computer-vision

I'm trying to calculate the cameras position for an image. I have 2 images of a rubiks cube. The first image is considered to be the base image and the next image is the image after the camera has moved. So for the first image I assume that the camera is at (0,0,0). On this image I then identify the 4 corners of the front face of the rubiks cube as shown here (4 corners identified by the 4 blue circles).
Then for the next image (after camera movement), I identify the same face of the rubiks cube as show here
So by assuming the first image as the base image, does anyone know if/how i can calculate how much the camera has moved for image 2 as shown here:

I would suggest you use OpenCV for this. I also think, this question would be more suited to StackOverflow.
The textbook on this subject would be "Multiple-View Geometry" by Hartley and Zisserman. http://www.robots.ox.ac.uk/~vgg/hzbook/ (There is a sample chapter on the Fundamental Matrix on that website.)
Basically, first find the Fundamental Matrix, then by knowing the intrinsic parameters of the camera, find a solution to the position.
Fundamental Matrix: http://en.wikipedia.org/wiki/Fundamental_matrix_%28computer_vision%29
Intrinsic Parameters: Stuff like the focal length and where the principal point is on the image plane. If you have F, then E = K^t * F * K, if K is the intrinsic matrix and the same for both images.
How to find a solution to the camera position: http://en.wikipedia.org/wiki/Essential_matrix#Determining_R_and_t_from_E
Algorithm
This is how I would do it in OpenCV. I have done this before, so it ought to work.
1. Run Feature Detection and Detector Extractor on both images.
2. Match Features.
3. Use F = cv::findFundamentalMatrix with Ransac.
4. E = K.t() * F * K. // K needs to be found beforehand.
5. Do SingularValueDecomposition of E such that E = U * S * V.t()
6. R = U * W.inv() * V.t() // W = [[0, -1, 0], [1, 0, 0], [0, 0, 1]]
7. Tx = V * Z * V.t() // Z = [[0, -1, 0], [1, 0, 0], [0, 0, 0]]
8. get t from Tx (matrix version of cross product)
9. Find the correct solution. R.t() and -t are possiblities.
10. Get overall scale by knowing the length of the size of the Rubrik's cube.
Alternative Solutions
I am certain that a more straightforward approach can also work. The benefit of this approach is that no human input is needed (unsupervised). This is not true for the optional step 10 (determining scale).
A different solution would exploit the knowledge of the geometry of the Rubrik's cube. For example, six (5.5) points are needed to estimate the position of the camera, if the point's 3D position is known.
Unfortunatly, I am not aware of any software that does this for you automatically.
So here is the alternative algorithm:
Write down the coordinates of the corners of the cube as (X_i, Y_i, Z_i), and possibly also points with other knowable positions.
Mark the corresponding points u_i = (x_i, y_i).
For every correspondence create two lines in a matrix A.
(X_i, Y_i, Z_i, 1, 0, 0, 0, 0, -x_iX_i, -x_iY_i, -x_iZ_i -x_i)
(0, 0, 0, 0, X_i, Y_i, Z_i, 1, -y_iX_i, -y_iY_i, -y_iZ_i -y_i)
Then find p such that Ap = 0. I.e. p is the right kernel of A, or the least-squared solution to Ap=0.
De-flatten p, to create a 3x4 matrix. P.

Related

kitti dataset camera projection matrix

I am looking at the kitti dataset and particularly how to convert a world point into the image coordinates. I looked at the README and it says below that I need to transform to camera coordinates first then multiply by the projection matrix. I have 2 questions, coming from a non computer vision background
I looked at the numbers from calib.txt and in particular the matrix is 3x4 with non-zero values in the last column. I always thought this matrix = K[I|0], where K is the camera's intrinsic matrix. So, why is the last column non-zero and what does it mean? e.g P2 is
array([[7.070912e+02, 0.000000e+00, 6.018873e+02, 4.688783e+01],
[0.000000e+00, 7.070912e+02, 1.831104e+02, 1.178601e-01],
[0.000000e+00, 0.000000e+00, 1.000000e+00, 6.203223e-03]])
After applying projection into [u,v,w] and dividing u,v by w, are these values with respect to origin at the center of image or origin being at the top left of the image?
README:
calib.txt: Calibration data for the cameras: P0/P1 are the 3x4
projection
matrices after rectification. Here P0 denotes the left and P1 denotes the
right camera. Tr transforms a point from velodyne coordinates into the
left rectified camera coordinate system. In order to map a point X from the
velodyne scanner to a point x in the i'th image plane, you thus have to
transform it like:
x = Pi * Tr * X
Refs:
How to understand the KITTI camera calibration files?
Format of parameters in KITTI's calibration file
http://www.cvlibs.net/publications/Geiger2013IJRR.pdf
Answer:
I strongly recommend you read those references above. They may solve most, if not all, of your questions.
For question 2: The projected points on images are with respect to origin at the top left. See ref 2 & 3, the coordinates of a far 3d point in image are (center_x, center_y), whose values are provided in the P_rect matrices. Or you can verify this with some simple codes:
import numpy as np
p = np.array([[7.070912e+02, 0.000000e+00, 6.018873e+02, 4.688783e+01],
[0.000000e+00, 7.070912e+02, 1.831104e+02, 1.178601e-01],
[0.000000e+00, 0.000000e+00, 1.000000e+00, 6.203223e-03]])
x = [0, 0, 1E8, 1] # A far 3D point
y = np.dot(p, x)
y[0] /= y[2]
y[1] /= y[2]
y = y[:2]
print(y)
You will see some output like:
array([6.018873e+02, 1.831104e+02 ])
which is quite near the (p[0, 2], p[1, 2]), a.k.a. (center_x, center_y).
For all the P matrices (3x4), they represent:
P(i)rect = [[fu 0 cx -fu*bx],
[0 fv cy -fv*by],
[0 0 1 0]]
Last column are baselines in meters w.r.t. the reference camera 0. You can see the P0 has all zeros in the last column because it is the reference camera.
This post has more details:
How Kitti calibration matrix was calculated?

Get camera matrix from OpenGL

I render a 3D mesh model using OpenGL with perspective camera – gluPerspective(fov, aspect, near, far).
Then I use rendered image in a computer vision algorithm.
At some point that algorithm requires camera matrix K (along with several vertices on the model and their corresponding projections) in order to estimate camera position: rotation matrix R and translation vector t. I can estimate R and t by using any algorithm which solves Perspective-n-Point problem.
I construct K from the OpenGL projection matrix (see how here)
K = [fX, 0, pX | 0, fY, pY | 0, 0, 1]
If I want to project a model point 'by hand' I can compute:
X_proj = K*(R*X_model + t)
x_pixel = X_proj[1] / X_proj[3]
y_pixel = X_proj[2] / X_proj[3]
Anyway, I pass this camera matrix in a PnP algorithm and it works just fine.
But then I had to change perspective projection to orthographic one.
As far as I understand when using orthographic projection the camera matrix becomes:
K = [1, 0, 0 | 0, 1, 0 | 0, 0, 0]
So I changed gluPerspective to glOrtho. Following the same way I constructed K from OpenGL projection matrix, and it turned out that fX and fY are not ones but 0.0037371. Is this a scaled orthographic projection or what?
Moreover, in order to project model vertices 'by hand' I managed to do the following:
X_proj = K*(R*X_model + t)
x_pixel = X_proj[1] + width / 2
y_pixel = X_proj[2] + height / 2
Which is not what I expected (that plus width and hight divided by 2 seems strange...). I tried to pass this camera matrix to POSIT algorithm to estimate R and t, and it doesn't converge. :(
So here are my questions:
How to get orthographic camera matrix from OpenGL?
If the way I did it is correct then is it true orthographic? Why POSIT doesn't work?
Orthographic projection will not use the depth to scale down farther points. Though, it will scale the points to fit inside the NDC which means it will scale the values to fit inside the range [-1,1].
This matrix from Wikipedia shows what this means:
So, it is correct to have numbers other than 1.
For your way of computing by hand, I believe it's not scaling back to screen coordinates and that makes it wrong. As I said, the output of projection matrices will be in the range [-1,1], and if you want to get the pixel coordinates, I believe you should do something similar to this:
X_proj = K*(R*X_model + t)
x_pixel = X_proj[1]*width/2 + width / 2
y_pixel = X_proj[2]*height/2 + height / 2
Anyway, I think you'd be better if you used modern OpenGL with libraries like GLM. In this case, you have the exact projection matrices used at hand.

Negative focal length in Camera Calibration Matrix

While trying to work with the ICL_NUIM dataset I have ran into some issues. The camera calibration provided on the site has the following values:
481.20, 0, 319.50
0, -480.00, 239.50
0, 0, 1
where:
fx = 481.20
fy = -480.00
cu = 319.50
cv = 239.50
I am struggling to intuitively understand how the fy can have a negative value.
The mathematical effect is simply that the image is vertically inverted.
This is the equivalent of the image as appears on the classical pinhole model, in the back plane of the camera box. In that case both the fx,fy would be negative, but you get the idea.

How to find an Equivalent point in a Scaled down image?

I would like to calculate the corner points or contours of the star in this in a Larger image. For that I'm scaling down the size to a smaller one & I'm able to get this points clearly. Now How to map this point in original image? I'm using opencv c++.
Consider a trivial example: the image size is reduced exactly by half.
So, the cartesian coordinate (x, y) in the original image becomes coordinate (x/2, y/2) in the reduced image, and coordinate (x', y') in the reduced image corresponds to coordinate (x*2, y*2) in the original image.
Of course, fractional coordinates get typically rounded off, in a reduced scale image, so the exact mapping is only possible for even-numbered coordinates in this example's original image.
Generalizing this, if the image's width is scaled by a factor of w horizontally and h vertically, coordinate (x, y) becomes coordinate(x*w, y*h), rounded off. In the example I gave, both w and h are 1/2, or .5
You should be able to figure out the values of w and h yourself, and be able to map the coordinates trivially. Of course, due to rounding off, you will not be able to compute the exact coordinates in the original image.
I realize this is an old question. I just wanted to add to Sam's answer above, to deal with "rounding off", in case other readers are wondering the same thing I faced.
This rounding off becomes obvious for even # of pixels across a coordinate axis. For instance, along a 1-D axis, a point demarcating the 2nd quartile gets mapped to an inaccurate value:
axis_prev = [0, 1, 2, 3]
axis_new = [0, 1, 2, 3, 4, 5, 6, 7]
w_prev = len(axis_prev) # This is an axis of length 4
w_new = len(axis_new) # This is an axis of length 8
x_prev = 2
x_new = x_prev * w_new / w_prev
print(x_new)
>>> 4
### x_new should be 5
In Python, one strategy would be to linearly interpolate values from one axis resolution to another axis resolution. Say for the above, we wish to map a point from the smaller image to its corresponding point of the star in the larger image:
import numpy as np
from scipy.interpolate import interp1d
x_old = np.linspace(0, 640, 641)
x_new = np.linspace(0, 768, 769)
f = interp1d(x_old, x_new)
x = 35
x_prime = f(x)

Z Value after Perspective Divide is always less than -1

So I'm writing my own custom 3D transformation pipeline in order to gain a better understanding of how it all works. I can get everything rendering to the screen properly and I'm now about to go back and look at clipping.
From my understanding, I should be clipping a vertex point if the x or y value after the perspective divide is outside the bounds of [-1, 1] and in my case if the z value is outside the bounds of [0, 1].
When i implement that however, my z value is always -1.xxxxxxxxxxx where xxxxxxx is a very small number.
This is a bit long, and I apologize, but I wanted to make sure I gave all the information I could.
First conventions:
I'm using a left-handed system where a Matrix looks like this:
[m00, m01, m02, m03]
[m10, m11, m12, m13]
[m20, m21, m22, m23]
[m30, m31, m32, m33]
And my vectors are columns looking like this:
[x]
[y]
[z]
[w]
My camera is set up with:
A vertical FOV in radians of PI/4.
An aspect ration of 1. (Square view port)
A near clip value of 1.
A far clip value of 1000.
An initial world x position of 0.
An initial world y position of 0.
An initial world z position of -500.
The camera is looking down the position Z axis (0, 0, 1)
Given a vertex, the pipeline works like this:
Step 1: Multiply the vertex by the camera matrix.
Step 2: Multiply the vertex by the projection matrix.
Projection matrix is:
[2.41421, 0, 0, 0]
[0 2.41421, 0, 0]
[0, 0, 1.001001, 1]
[0, 0, -1.001001, 0]
Step 3: Multiply the x, y and z components by 1/w.
Step 4: [This is where the problem is] Clip the vertex if outside bounds.
Step 5: Convert to screen coordinates.
An example vertex that I have is
(-100, -100, 0, 1)
After multiplying by the camera matrix i get:
(-100, -100, 500, 1)
Which makes sense because relative to the camera, that vertex is 100 units to the left and down and 500 units ahead. It is also between the near clip of 1 and the far clip of 1000. W is still 1.
After multiplying by the projection matrix i get:
(-241.42135, -241.42135, 601.600600, -600.600600)
This I'm not sure if it makes sense. The x and y seem to be correct, but i'm iffy about the z and w since the next step of perspective divide is odd.
After the perspective divide I get:
(0.401966, 0.401966, -1.001665, 1)
Again the x and y make sense, they are both within the bounds of [-1, 1]. But the z value is clearly outside the bounds even though i believe it should still be within the frustrum. W is back to 1 which again makes sense.
Again apologies for the novel, but I'm hoping someone can help me figure out what I'm doing incorrectly.
Thanks!
Ok, it looks like I figured out what the problem it was.
My projection matrix was:
[2.41421, 0, 0, 0]
[0 2.41421, 0, 0]
[0, 0, 1.001001, 1]
[0, 0, -1.001001, 0]
But it really should be transposed and be:
[2.41421, 0, 0, 0]
[0 2.41421, 0, 0]
[0, 0, 1.001001, -1.001001]
[0, 0, 1, 0]
When using this matrix, my x and y values stay the same as expected and now my z values are constrained to be within [0, 1] and only exceed that range if they are outside the near of far clip plane.
The only issue now is that I'm quite confused as to whether I'm using a right or left handed system.
All i know is that now it works...
I may be out of my league here, but I thought that the purpose of the projection matrix and perspective divide were to discover the 2D position of that point on the screen. In that case, the left-over z value would not necessarily have any meaning any more, since the math is all geared towards finding those two x and y values.
Update: I think I have it figured out. Your math is all correct. The camera and frustum you describe has a near clipping plane at Z=1, so your example point at (-100, 100, 0) is actually outside of the clipping plane, so that z-buffer value of just below -1 makes perfect sense.
Try a sample point with a z-coordinate inside your frustum, say with a z-coordinate of 2.