I am trying to understand the basic principles of 3D reconstruction, and have chosen to play around with OpenMVG. However, I have seen evidence that the following concepts I'm asking about apply to all/most SfM/MVS tools, not just OpenMVG. As such, I suspect any Computer Vision engineer should be able to answer these questions, even if they have no direct OpenMVG experience.
I'm trying to fully understand intrinsic camera parameters, or as they seem to be called, "camera instrinsics", or "intrinsic parameters". According to OpenMVG's documentation, camera intrinsics depend on the type of camera that is used to take the pictures (e.g., the camera model), of which, OpenMVG supports five models:
Pinhole: 3 intrinsic parameters (focal, principal point x, principal point y)
Pinhole Radial 1: 4 intrinsic params (focal, principal point x, principal point y, one radial distortion factor)
Pinhole Radial 3: 6 params (focal, principal point x, principal point y, 3 radial distortion factors)
Pinhole Brown: 8 params (focal, principal point x, principal point y, 5 distortion factors (3radial+2 tangential))
Pinhole w/ Fish-Eye Distortion: 7 params (focal, principal point x, principal point y, 4 distortion factors)
This is all explained on their wiki page that explains their camera model, which is the subject of my question.
On that page there are several core concepts that I need clarification on:
focal plane: What it is and how does it differ from the image plane (as shown in the diagram at the top of that page)?
focal distance/length: What is it?
principal point: What is it, and why should it ideally be the center of the image?
scale factor: Is this just an estimate of how far the camera is from the image plane?
distortion: What is it and what's the difference between its various subtypes:
radial
tangential
fish-eye
Thanks in advance for any clarification/correction here!
I am unsure about the focal plane, so I will come back to it after I write about the other concepts you mention. Suppose you have a pinhole camera model with rectangular pixels, and let P=[X Y Z]^T be a point in camera space, with ^T denoting the transpose. In that case (assuming Z is the camera axis), this point can be projected as p=KP where K (the calibration matrix) is
f_x 0 c_x
0 f_y c_y
0 0 1
(of course, you will want to divide p by its third coordinate after that).
The focal length, that I will note f is the distance between the camera center and the image plane. The variables
f_x=s_x*f
f_y=s_y*f
in the matrix above respectively express this value in terms of pixel width and height. The variables s_x and s_y are the scale factors that are mentioned on the page you cite. The scale factor is the ratio between the size (width or height) of pixels and the units that you use in camera space. So, for example, if your pixel widths are half the size of the units you use on the x axis of camera space, you will have s_x=2.
I have seen people use the term principal point to refer to different things. While some people define it as the intersection between the camera axis and the image plane (Wikipedia seems to do this), others define it as the point given by [c_x c_y]^T. For clarity's sake, let's separate the whole projection process:
The two terms on the right hand side of the equation do different things. The first one scales the point and puts it into the image plane. The second term (i.e. [c_x c_y 1]^T) shifts the result from the other term. So, [-c_x ,-c_y]^T is the center of the image's coordinate system.
As for the difference between tangential/radial distortion: usually when correcting distortion, we assume that the center of the image o remains undistorted. A pixel p will have "moved away" from its true position q under the effect of distortion. If that movement is along the vector q-o then the distortion is radial, but if that movement has a component in a different direction, it is said to (also) have tangential distortion.
As I said I'm a bit unsure about what the focal plane they show in their figure means, but I think the term usually refers to the plane on which the upside-down image would form in a physical pinhole camera. A point P on the image plane (expressed in world coordinates) would just be -P on the focal plane.
Related
I am willing to perform a 360° Panorama stitching for 6 fish-eye cameras.
In order to find the relation among cameras I need to compute the Homography Matrix. The latter is usually computed by finding features in the images and matching them.
However, for my camera setup I already know:
The intrinsic camera matrix K, which I computed through camera calibration.
Extrinsic camera parameters R and t. The camera orientation is fixed and does not change at any point. The cameras are located on a circle of known diameter d, being each camera positioned with a shift of 60° degrees with respect to the circle.
Therefore, I think I could manually compute the Homography Matrix, which I am assuming would result in a more accurate approach than performing feature matching.
In the literature I found the following formula to compute the homography Matrix which relates image 2 to image 1:
H_2_1 = (K_2) * (R_2)^-1 * R_1 * K_1
This formula only takes into account a rotation angle among the cameras but not the translation vector that exists in my case.
How could I plug the translation t of each camera in the computation of H?
I have already tried to compute H without considering the translation, but as d>1 meter, the images are not accurate aligned in the panorama picture.
EDIT:
Based on Francesco's answer below, I got the following questions:
After calibrating the fisheye lenses, I got a matrix K with focal length f=620 for an image of size 1024 x 768. Is that considered to be a big or small focal length?
My cameras are located on a circle with a diameter of 1 meter. The explanation below makes it clear for me, that due to this "big" translation among the cameras, I have remarkable ghosting effects with objects that are relative close to them. Therefore, if the Homography model cannot fully represent the position of the cameras, is it possible to use another model like Fundamental/Essential Matrix for image stitching?
You cannot "plug" the translation in: its presence along with a nontrivial rotation mathematically implies that the relationship between images is not a homography.
However, if the imaged scene is and appears "far enough" from the camera, i.e. if the translations between cameras are small compared to the distances of the scene objects from the cameras, and the cameras' focal lengths are small enough, then you may use the homography induced by a pure rotation as an approximation.
Your equation is wrong. The correct formula is obtained as follows:
Take a pixel in camera 1: p_1 = (x, y, 1) in homogeneous coordinates
Back project it into a ray in 3D space: P_1 = inv(K_1) * p_1
Decompose the ray in the coordinates of camera 2: P_2 = R_2_1 * P1
Project the ray into a pixel in camera 2: p_2 = K_2 * P_2
Put the equations together: p_2 = [K_2 * R_2_1 * inv(K_1)] * p_1
The product H = K2 * R_2_1 * inv(K1) is the homography induced by the pure rotation R_2_1. The rotation transforms points into frame 2 from frame 1. It is represented by a 3x3 matrix whose columns are the components of the x, y, z axes of frame 1 decomposed in frame 2. If your setup gives you the rotations of all the cameras with respect to a common frame 0, i.e. as R_i_0, then it is R_2_1 = R_2_0 * R_1_0.transposed.
Generally speaking, you should use the above homography as an initial estimation, to be refined by matching points and optimizing. This is because (a) the homography model itself is only an approximation (since it ignores the translation), and (b) the rotations given by the mechanical setup (even a calibrated one) are affected by errors. Using matched pixels to optimize the transformation will minimize the errors where it matters, on the image, rather than in an abstract rotation space.
I'm working on an Eye Tracking system with two cameras mounted on some kind of glasses. There are optical lenses so that the screen is perceived at around 420 mm from the eye.
From a few dozen pupil samples, we compute two eye models (one for each camera), located in their respective camera coordinates system. This is based on the works here, but modified so that an estimation of the eye center is found using some kind of brute-force approach to minimize the ellipse projection error on the model given its center position in camera space.
Theorically, an approximation of the cameras parameters would be symetrical to the lenses on the Y axis. So every camera should be at the coordinates (around 17.5mm or -17.5, 0, 3.3) with respect to the lenses coordinates system, a rotation of around 42.5 degrees on the Y axis.
With the However, with these values, there is an offset in the result. See below:
The red point is the gaze center estimated by the left eye tracker, the white one is the right eye tracker, in screen coordinates
The screen limits are represented by the white lines.
The green line is the gaze vector, in camera coordinates (projected in 2D for visualization)
The two camera centers found, projected in 2D, are in the middle of the eye (the blue circle).
The pupil samples and current pupils are represented by the ellipses with matching colors.
The offset on x isn't constant which mean the rotation on Y is not exact. and the position of the camera aren't precise too. In order to fix it, we used: this to calibrate and then this to get the rotation parameters from the rotation matrix.
We added a camera on the middle of the lenses (Close to the theorical 0,0,0 point ?) to get the extrinsics and intrinsic parameters of the cameras, relative to our lens center. However, with about 50 checkerboard captures from different positions, the results given by OpenCV doesn't seems correct.
For example, it gives for a camera a position of about (-14,0,10) in lens coordinates for the translation and something like (-2.38, 49, -2.83) as rotation angles in degrees.
The previous screenshots are taken with theses parameters. The theorical ones are a bit further apart, but are more likely to reach the screen borders, unlike the opencv value.
This is probably because the test camera is in front of the optic, not behind, where our real 0,0,0 would be located (we just add the distance at which the screen is perceived on the Z axis afterwards, which is 420mm).
However, we have no way to put the camera in (0, 0, 0).
As the system is compact (everything is captured within a few cm^2), each degree or millimeter can change the result drastically so without the precise value the cameras, we're a bit stuck.
Our objective here is to find an accurate way to get the extrinsic and intrisic parameters of each cameras, so that we can compute a precise position of the center of the eye of the person wearing the glasses, without other calibration procedure than looking around (so no fixation points)
Right now, the system is precise enough so that we get a global indication on where someone is looking on the screen,but there is a divergence between the right and left camera, it's not precise enough. Any advice or hint that could help us is welcome :)
I am working on building 3D point cloud from features matching using OpenCV3.1 and OpenGL.
I have implemented 1) Camera Calibration (Hence I am having Intrinsic Matrix of the camera) 2) Feature extraction( Hence I have 2D points in Pixel Coordinates).
I was going through few websites but generally all have suggested the flow for converting 3D object points to pixel points but I am doing completely backword projection. Here is the ppt that explains it well.
I have implemented film coordinates(u,v) from pixel coordinates(x,y)(With the help of intrisic matrix). Can anyone shed the light on how I can render "Z" of camera coordinate(X,Y,Z) from the film coordinate(x,y).
Please guide me on how I can utilize functions for the desired goal in OpenCV like solvePnP, recoverPose, findFundamentalMat, findEssentialMat.
With single camera and rotating object on fixed rotation platform I would implement something like this:
Each camera has resolution xs,ys and field of view FOV defined by two angles FOVx,FOVy so either check your camera data sheet or measure it. From that and perpendicular distance (z) you can convert any pixel position (x,y) to 3D coordinate relative to camera (x',y',z'). So first convert pixel position to angles:
ax = (x - (xs/2)) * FOVx / xs
ay = (y - (ys/2)) * FOVy / ys
and then compute cartesian position in 3D:
x' = distance * tan(ax)
y' = distance * tan(ay)
z' = distance
That is nice but on common image we do not know the distance. Luckily on such setup if we turn our object than any convex edge will make an maximum ax angle on the sides if crossing the perpendicular plane to camera. So check few frames and if maximal ax detected you can assume its an edge (or convex bump) of object positioned at distance.
If you also know the rotation angle ang of your platform (relative to your camera) Then you can compute the un-rotated position by using rotation formula around y axis (Ay matrix in the link) and known platform center position relative to camera (just subbstraction befor the un-rotation)... As I mention all this is just simple geometry.
In an nutshell:
obtain calibration data
FOVx,FOVy,xs,ys,distance. Some camera datasheets have only FOVx but if the pixels are rectangular you can compute the FOVy from resolution as
FOVx/FOVy = xs/ys
Beware with Multi resolution camera modes the FOV can be different for each resolution !!!
extract the silhouette of your object in the video for each frame
you can subbstract the background image to ease up the detection
obtain platform angle for each frame
so either use IRC data or place known markers on the rotation disc and detect/interpolate...
detect ax maximum
just inspect the x coordinate of the silhouette (for each y line of image separately) and if peak detected add its 3D position to your model. Let assume rotating rectangular box. Some of its frames could look like this:
So inspect one horizontal line on all frames and found the maximal ax. To improve accuracy you can do a close loop regulation loop by turning the platform until peak is found "exactly". Do this for all horizontal lines separately.
btw. if you detect no ax change over few frames that means circular shape with the same radius ... so you can handle each of such frame as ax maximum.
Easy as pie resulting in 3D point cloud. Which you can sort by platform angle to ease up conversion to mesh ... That angle can be also used as texture coordinate ...
But do not forget that you will lose some concave details that are hidden in the silhouette !!!
If this approach is not enough you can use this same setup for stereoscopic 3D reconstruction. Because each rotation behaves as new (known) camera position.
You can't, if all you have is 2D images from that single camera location.
In theory you could use heuristics to infer a Z stacking. But mathematically your problem is under defined and there's literally infinitely many different Z coordinates that would evaluate your constraints. You have to supply some extra information. For example you could move your camera around over several frames (Google "structure from motion") or you could use multiple cameras or use a camera that has a depth sensor and gives you complete XYZ tuples (Kinect or similar).
Update due to comment:
For every pixel in a 2D image there is an infinite number of points that is projected to it. The technical term for that is called a ray. If you have two 2D images of about the same volume of space each image's set of ray (one for each pixel) intersects with the set of rays corresponding to the other image. Which is to say, that if you determine the ray for a pixel in image #1 this maps to a line of pixels covered by that ray in image #2. Selecting a particular pixel along that line in image #2 will give you the XYZ tuple for that point.
Since you're rotating the object by a certain angle θ along a certain axis a between images, you actually have a lot of images to work with. All you have to do is deriving the camera location by an additional transformation (inverse(translate(-a)·rotate(θ)·translate(a)).
Then do the following: Select a image to start with. For the particular pixel you're interested in determine the ray it corresponds to. For that simply assume two Z values for the pixel. 0 and 1 work just fine. Transform them back into the space of your object, then project them into the view space of the next camera you chose to use; the result will be two points in the image plane (possibly outside the limits of the actual image, but that's not a problem). These two points define a line within that second image. Find the pixel along that line that matches the pixel on the first image you selected and project that back into the space as done with the first image. Due to numerical round-off errors you're not going to get a perfect intersection of the rays in 3D space, so find the point where the ray are the closest with each other (this involves solving a quadratic polynomial, which is trivial).
To select which pixel you want to match between images you can use some feature motion tracking algorithm, as used in video compression or similar. The basic idea is, that for every pixel a correlation of its surroundings is performed with the same region in the previous image. Where the correlation peaks is, where it likely was moved from into.
With this pixel tracking in place you can then derive the structure of the object. This is essentially what structure from motion does.
I'm working with openGL but this is basically a math question.
I'm trying to calculate the projection matrix, I have a point on the view plane R(x,y,z) and the Normal vector of that plane N(n1,n2,n3).
I also know that the eye is at (0,0,0) which I guess in technical terms its the Perspective Reference Point.
How can I arrive the perspective projection from this data? I know how to do it the regular way where you get the FOV, aspect ration and near and far planes.
I think you created a bit of confusion by putting this question under the "opengl" tag. The problem is that in computer graphics, the term projection is not understood in a strictly mathematical sense.
In maths, a projection is defined (and the following is not the exact mathematical definiton, but just my own paraphrasing) as something which doesn't further change the results when applied twice. Think about it. When you project a point in 3d space to a 2d plane (which is still in that 3d space), each point's projection will end up on that plane. But points which already are on this plane aren't moving at all any more, so you can apply this as many times as you want without changing the outcome any further.
The classic "projection" matrices in computer graphics don't do this. They transfrom the space in a way that a general frustum is mapped to a cube (or cuboid). For that, you basically need all the parameters to describe the frustum, which typically is aspect ratio, field of view angle, and distances to near and far plane, as well as the projection direction and the center point (the latter two are typically implicitely defined by convention). For the general case, there are also the horizontal and vertical asymmetries components (think of it like "lens shift" with projectors). And all of that is what the typical projection matrix in computer graphics represents.
To construct such a matrix from the paramters you have given is not really possible, because you are lacking lots of parameters. Also - and I think this is kind of revealing - you have given a view plane. But the projection matrices discussed so far do not define a view plane - any plane parallel to the near or far plane and in front of the camera can be imagined as the viewing plane (behind the camere would also work, but the image would be mirrored), if you should need one. But in the strict sense, it would only be a "view plane" if all of the projected points would also end up on that plane - which the computer graphics perspective matrix explicitely does'nt do. It instead keeps their 3d distance information - which also means that the operation is invertible, while a classical mathematical projection typically isn't.
From all of that, I simply guess that what you are looking for is a perspective projection from 3D space onto a 2D plane, as opposed to a perspective transformation used for computer graphics. And all parameters you need for that are just the view point and a plane. Note that this is exactly what you have givent: The projection center shall be the origin and R and N define the plane.
Such a projection can also be expressed in terms of a 4x4 homogenous matrix. There is one thing that is not defined in your question: the orientation of the normal. I'm assuming standard maths convention again and assume that the view plane is defined as <N,x> + d = 0. From using R in that equation, we can get d = -N_x*R_x - N_y*R_y - N_z*R_z. So the projection matrix is just
( 1 0 0 0 )
( 0 1 0 0 )
( 0 0 1 0 )
(-N_x/d -N_y/d -N_z/d 0 )
There are a few properties of this matrix. There is a zero column, so it is not invertible. Also note that for every point (s*x, s*y, s*z, 1) you apply this to, the result (after division by resulting w, of course) is just the same no matter what s is - so every point on a line between the origin and (x,y,z) will result in the same projected point - which is what a perspective projection is supposed to do. And finally note that w=(N_x*x + N_y*y + N_z*z)/-d, so for every point fulfilling the above plane equation, w= -d/-d = 1 will result. In combination with the identity transform for the other dimensions, which just means that such a point is unchanged.
Projection matrix must be at (0,0,0) and viewing in Z+ or Z- direction
this is a must because many things in OpenGL depends on it like FOG,lighting ... So if your direction or position is different then you need to move this to camera matrix. Let assume your focal point is (0,0,0) as you stated and the normal vector is (0,0,+/-1)
Z near
is the distance between focal point and projection plane so znear is perpendicular distance of plane and (0,0,0). If assumption is correct then
znear=R.z
otherwise you need to compute that. I think you got everything you need for it
cast line from R with direction N
find closest point to focal point (0,0,0)
and then the z near is the distance of that point to R
Z far
is determined by the depth buffer bit width and z near
zfar=znear*(1<<(cDepthBits-1))
this is the maximal usable zfar (for mine purposes) if you need more precision then lower it a bit do not forget precision is higher near znear and much much worse near zfar. The zfar is usually set to the max view distance and znear computed from it or set to min focus range.
view angle
I use mostly 60 degree view. zang=60.0 [deg]
Common males in my region can see up to 90 degrees but that is peripherial view included the 60 degree view is more comfortable to view.
Females have a bit wider view ... but I did not heard any complains from them on 60 degree views ever so let assume its comfortable for them too...
Aspect
aspect ratio is determined by your OpenGL window dimensions xs,ys
aspect=(xs/ys)
This is how I set the projection matrix:
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(zang/aspect,aspect,znear,zfar);
// gluPerspective has inacurate tangens so correct perspective matrix like this:
double perspective[16];
glGetDoublev(GL_PROJECTION_MATRIX,perspective);
perspective[ 0]= 1.0/tan(0.5*zang*deg);
perspective[ 5]=aspect/tan(0.5*zang*deg);
glLoadMatrixd(perspective);
deg = M_PI/180.0
perspective is projection matrix copy I use it for mouse position conversions etc ...
If you do not correct the matrix then you will be off when using advanced things like overlapping more frustrum to get high precision depth range. I use this to obtain <0.1m,1000AU> frustrum with 24bit depth buffer and the inaccuracy would cause the images will not fit perfectly ...
[Notes]
if the focal point is not really (0,0,0) or you are not viewing in Z axis (like you do not have camera matrix but instead use projection matrix for that) then on basic scenes/techniques you will see no problem. They starts with use of advanced graphics. If you use GLSL then you can handle this without problems but fixed OpenGL function can not handle this properly. This is also called PROJECTION_MATRIX abuse
[edit1] few links
If your view is standard frustrum then write the matrix your self gluPerspective otherwise look here Projections for some ideas how to construct it
[edit2]
From your comment I see it like this:
f is your viewing point (axises are the global world axises)
f' is viewing point if R would be the center of screen
so create projection matrix for f' position (as explained above), create transform matrix to transform f' to f. The transformed f must have Z axis the same as in f' the other axises can be obtained by cross product and use that as camera or multiply booth together and use as abused Projection matrix
How to construct the matrix is explained in the Understanding transform matrices link from my earlier comments
I'd like to know how I can go about calculating the angle of some pixel in a photo relative to the webcam that I'm using. I'm new to this sort of thing and I'm using a webcam. Essentially, I take a photo, process it, and I end up with a pixel value in the image that is what I'm looking for. I then need to somehow turn that pixel value into some meaningful quantity---I need to find a line/vector that passes through the pixel and the camera. I don't need magnitude, just phase.
How does one go about doing this? Is camera calibration necessary? I've been reading a bit about it but am unsure.
Thanks
You don't need to know the distance to the object, only the resolution and angle of view of the camera.
Computing the angle requires only simple linear interpolation. For example, let's assume a camera with a resolution of 1920x1080 that covers a 45 degree angle of view across the diagonal.
In this case, sqrt(19202 + 10802) gives 2292.19 pixels along the diagonal. That means each pixel represents 45/2292.19 = .0153994 degrees.
So, compute the distance from the center (in pixels), multiply by .0153994, and you have its angle from the center (for that camera -- for yours, you'll obviously have to use its resolution and angle of view).
Of course, this is somewhat approximate -- its accuracy will depend on how much distortion the lens has. With a zoom lens (especially wider angle) you can generally count on that being fairly high. With a fixed focal length lens (especially if it doesn't cover an angle wider than 90 degrees or so) it'll usually be pretty low.
If you want to improve accuracy, you can start by taking a picture of a flat rectangle with straight lines just inside the angle of view of the camera, then compute the distortion based on the deviation from perfectly straight in the resulting picture. If you're working with an extremely wide angle lens, this may be nearly essential. With a lens covering a narrower angle of view (especially, as already mentioned, if it's fixed focal length) it's rarely likely to be worthwhile (such lenses often have only a fraction of a percent of distortion).
Recipe:
1 - Calibrate the camera, obtaining the camera matrix K and distortion parameters D. In OpenCV this is done as described in this tutorial.
2 - Remove the nonlinear distortion from the pixel positions of interest. In OpenCV is done using the undistortPoints routine, without passing arguments R and P.
3 - Back-project the pixels of interest into rays (unit vectors with the tail at the camera center) in camera 3D coordinates, by multiplying their pixel positions in homogeneous coordinates times the inverse of the camera matrix.
4 - The angle you want is the angle between the above vectors and (0, 0, 1), the vector associated to the camera's focal axis.