Unstable values in ArUco pose estimation - c++

I'm trying to find the orientation of the camera using Aruco marker. Euler angles extracted from the rotation matrix are unstable beyond a certain point.
As the distance of the camera increases from the marker, the yaw angle values of camera is just unstable. The "Z" axis on the marker flips.
The euler angles are jittery, not the same in every frame and take time to stabilize. How do I obtain some reliable values of the yaw angle and distance between the camera and marker?
I am trying to find the pose of moving camera w.r.t a static marker.
I implemented solvePnP and solvePnPRansac both yielding in unstable results.
The rotation matrix obtained after converting rotation vectors from estimatePoseSingleMarker seems alright up to a certain point but loses stability.
How do I go about this?
Thank you

In general, you won't get accurate camera pose estimation from a single marker. The solution is to add more markers. You could use either a marker board, or a more sparse pattern of markers.
As a single marker gets further from the camera, several factors work to reduce the accuracy of the marker pose estimate.
the projected size of the marker becomes smaller and more quantized by the pixel grid. Distance is estimated by inverse perspective division, so it becomes less accurate as distance increases.
perspective distortion reduces, approaching a parallel projection. In a parallel projection the marker has two equally viable orientations, which may be returned alternately (see https://en.wikipedia.org/wiki/Necker_cube). The orientation of the marker relative to the camera is also significant - in more perpendicular views of the marker (orthographic projection), pitch and yaw of the marker are ambiguous, compared to oblique views. Reduced perspective distortion with distance makes this effect worse, and will cause the calculated camera pose to yaw, pitch, and move laterally.
given the smaller number of pixels in the marker, small scale effects such as sensor noise and quantization become more significant, reducing stability from frame to frame and causing jitter.
As you have discovered, pose estimation works OK in close-up, oblique views of a single marker, because the projected points given to solvePnP() are far apart and have large perspective distortion. By adding more markers, you always have ideal projected points for solvePnP().

Related

Invalid cameras calibration for an head mounter Eye Tracking system

I'm working on an Eye Tracking system with two cameras mounted on some kind of glasses. There are optical lenses so that the screen is perceived at around 420 mm from the eye.
From a few dozen pupil samples, we compute two eye models (one for each camera), located in their respective camera coordinates system. This is based on the works here, but modified so that an estimation of the eye center is found using some kind of brute-force approach to minimize the ellipse projection error on the model given its center position in camera space.
Theorically, an approximation of the cameras parameters would be symetrical to the lenses on the Y axis. So every camera should be at the coordinates (around 17.5mm or -17.5, 0, 3.3) with respect to the lenses coordinates system, a rotation of around 42.5 degrees on the Y axis.
With the However, with these values, there is an offset in the result. See below:
The red point is the gaze center estimated by the left eye tracker, the white one is the right eye tracker, in screen coordinates
The screen limits are represented by the white lines.
The green line is the gaze vector, in camera coordinates (projected in 2D for visualization)
The two camera centers found, projected in 2D, are in the middle of the eye (the blue circle).
The pupil samples and current pupils are represented by the ellipses with matching colors.
The offset on x isn't constant which mean the rotation on Y is not exact. and the position of the camera aren't precise too. In order to fix it, we used: this to calibrate and then this to get the rotation parameters from the rotation matrix.
We added a camera on the middle of the lenses (Close to the theorical 0,0,0 point ?) to get the extrinsics and intrinsic parameters of the cameras, relative to our lens center. However, with about 50 checkerboard captures from different positions, the results given by OpenCV doesn't seems correct.
For example, it gives for a camera a position of about (-14,0,10) in lens coordinates for the translation and something like (-2.38, 49, -2.83) as rotation angles in degrees.
The previous screenshots are taken with theses parameters. The theorical ones are a bit further apart, but are more likely to reach the screen borders, unlike the opencv value.
This is probably because the test camera is in front of the optic, not behind, where our real 0,0,0 would be located (we just add the distance at which the screen is perceived on the Z axis afterwards, which is 420mm).
However, we have no way to put the camera in (0, 0, 0).
As the system is compact (everything is captured within a few cm^2), each degree or millimeter can change the result drastically so without the precise value the cameras, we're a bit stuck.
Our objective here is to find an accurate way to get the extrinsic and intrisic parameters of each cameras, so that we can compute a precise position of the center of the eye of the person wearing the glasses, without other calibration procedure than looking around (so no fixation points)
Right now, the system is precise enough so that we get a global indication on where someone is looking on the screen,but there is a divergence between the right and left camera, it's not precise enough. Any advice or hint that could help us is welcome :)

Film coordinate to world coordinate

I am working on building 3D point cloud from features matching using OpenCV3.1 and OpenGL.
I have implemented 1) Camera Calibration (Hence I am having Intrinsic Matrix of the camera) 2) Feature extraction( Hence I have 2D points in Pixel Coordinates).
I was going through few websites but generally all have suggested the flow for converting 3D object points to pixel points but I am doing completely backword projection. Here is the ppt that explains it well.
I have implemented film coordinates(u,v) from pixel coordinates(x,y)(With the help of intrisic matrix). Can anyone shed the light on how I can render "Z" of camera coordinate(X,Y,Z) from the film coordinate(x,y).
Please guide me on how I can utilize functions for the desired goal in OpenCV like solvePnP, recoverPose, findFundamentalMat, findEssentialMat.
With single camera and rotating object on fixed rotation platform I would implement something like this:
Each camera has resolution xs,ys and field of view FOV defined by two angles FOVx,FOVy so either check your camera data sheet or measure it. From that and perpendicular distance (z) you can convert any pixel position (x,y) to 3D coordinate relative to camera (x',y',z'). So first convert pixel position to angles:
ax = (x - (xs/2)) * FOVx / xs
ay = (y - (ys/2)) * FOVy / ys
and then compute cartesian position in 3D:
x' = distance * tan(ax)
y' = distance * tan(ay)
z' = distance
That is nice but on common image we do not know the distance. Luckily on such setup if we turn our object than any convex edge will make an maximum ax angle on the sides if crossing the perpendicular plane to camera. So check few frames and if maximal ax detected you can assume its an edge (or convex bump) of object positioned at distance.
If you also know the rotation angle ang of your platform (relative to your camera) Then you can compute the un-rotated position by using rotation formula around y axis (Ay matrix in the link) and known platform center position relative to camera (just subbstraction befor the un-rotation)... As I mention all this is just simple geometry.
In an nutshell:
obtain calibration data
FOVx,FOVy,xs,ys,distance. Some camera datasheets have only FOVx but if the pixels are rectangular you can compute the FOVy from resolution as
FOVx/FOVy = xs/ys
Beware with Multi resolution camera modes the FOV can be different for each resolution !!!
extract the silhouette of your object in the video for each frame
you can subbstract the background image to ease up the detection
obtain platform angle for each frame
so either use IRC data or place known markers on the rotation disc and detect/interpolate...
detect ax maximum
just inspect the x coordinate of the silhouette (for each y line of image separately) and if peak detected add its 3D position to your model. Let assume rotating rectangular box. Some of its frames could look like this:
So inspect one horizontal line on all frames and found the maximal ax. To improve accuracy you can do a close loop regulation loop by turning the platform until peak is found "exactly". Do this for all horizontal lines separately.
btw. if you detect no ax change over few frames that means circular shape with the same radius ... so you can handle each of such frame as ax maximum.
Easy as pie resulting in 3D point cloud. Which you can sort by platform angle to ease up conversion to mesh ... That angle can be also used as texture coordinate ...
But do not forget that you will lose some concave details that are hidden in the silhouette !!!
If this approach is not enough you can use this same setup for stereoscopic 3D reconstruction. Because each rotation behaves as new (known) camera position.
You can't, if all you have is 2D images from that single camera location.
In theory you could use heuristics to infer a Z stacking. But mathematically your problem is under defined and there's literally infinitely many different Z coordinates that would evaluate your constraints. You have to supply some extra information. For example you could move your camera around over several frames (Google "structure from motion") or you could use multiple cameras or use a camera that has a depth sensor and gives you complete XYZ tuples (Kinect or similar).
Update due to comment:
For every pixel in a 2D image there is an infinite number of points that is projected to it. The technical term for that is called a ray. If you have two 2D images of about the same volume of space each image's set of ray (one for each pixel) intersects with the set of rays corresponding to the other image. Which is to say, that if you determine the ray for a pixel in image #1 this maps to a line of pixels covered by that ray in image #2. Selecting a particular pixel along that line in image #2 will give you the XYZ tuple for that point.
Since you're rotating the object by a certain angle θ along a certain axis a between images, you actually have a lot of images to work with. All you have to do is deriving the camera location by an additional transformation (inverse(translate(-a)·rotate(θ)·translate(a)).
Then do the following: Select a image to start with. For the particular pixel you're interested in determine the ray it corresponds to. For that simply assume two Z values for the pixel. 0 and 1 work just fine. Transform them back into the space of your object, then project them into the view space of the next camera you chose to use; the result will be two points in the image plane (possibly outside the limits of the actual image, but that's not a problem). These two points define a line within that second image. Find the pixel along that line that matches the pixel on the first image you selected and project that back into the space as done with the first image. Due to numerical round-off errors you're not going to get a perfect intersection of the rays in 3D space, so find the point where the ray are the closest with each other (this involves solving a quadratic polynomial, which is trivial).
To select which pixel you want to match between images you can use some feature motion tracking algorithm, as used in video compression or similar. The basic idea is, that for every pixel a correlation of its surroundings is performed with the same region in the previous image. Where the correlation peaks is, where it likely was moved from into.
With this pixel tracking in place you can then derive the structure of the object. This is essentially what structure from motion does.

Calculate Camera Position from Homography Decomposition

I have a reference image A with a known position and I want to calculate the relative position of the camera at image B (i.e. tx, ty, tz in meters). The images are taken with the same camera so the camera matrix stays the same. I'm using SIFT to detect and compute the keypoints and descriptors in both images and match them with FLANN. From there I can get the homography matrix which I decompose with cv::decomposeHomography(..). This function is based on this paper: PDF.
In this paper it is stated, that the translation matrix is normalized by d*, which is the plane depth.
In order to get the correct translation I need to know the plane depth. Is there a way to get this without knowing the size of an object found in the image?
The 3D translation computed using homography decomposition is only computable up to an unknown scale factor. This is a classical problem with computing 3D geometry from monocular images using only apparent motion in the images. Typically 3D reconstructions from monocular images are called metric reconstructions for this reason (rather than Euclidean reconstructions where scale is resolved). To resolve the scale factor some more information is needed, such as knowing the depth of a point on the plane or the distance moved by the camera between images.

Matching top view human detections with floor projection on interactive floor project

I'm building an interactive floor. The main idea is to match the detections made with a Xtion camera with objects I draw in a floor projection and have them following the person.
I also detect the projection area on the floor which translates to a polygon. the camera can detect outside the "screen" area.
The problem is that the algorithm detects the the top most part of the person under it using depth data and because of the angle between that point and the camera that point isn't directly above the person's feet.
I know the distance to the floor and the height of the person detected. And I know that the camera is not perpendicular to the floor but I don't know the camera's tilt angle.
My question is how can I project that 3D point onto the polygon on the floor?
I'm hoping someone can point me in the right direction. I've been reading about camera projections but I'm not seeing how to use it in this particular problem.
Thanks in advance
UPDATE:
With the awnser from Diego O.d.L I was able to get an almost perfect detection. I'll write the steps I used for those who might be looking for the same solution (I won't get into much detail on how detection is made):
Step 1 : Calibration
Here I get some color and depth frames from the camera, using openNI, with the projection area cleared.
The projection area is detected on the color frames.
I then convert the detection points to real world coordinates (using OpenNI's CoordinateConverter). With the new real world detection points I look for the plane that better fits them.
Step 2: Detection
I use the detection algorithm to get new person detections and to track them using the depth frames.
These detection points are converted to real world coordinates and projected to the plane previously computed. This corrects the offset between the person's height and the floor.
The points are mapped to screen coordinates using a perspective transform.
Hope this helps. Thank you again for the awnsers.
Work with the camera coordinate system initially. I'm assuming you don't have problems converting from (row,column,distance) to a real world system aligned with the camera axis (x,y,z):
calculate the plane with three or more points (for robustness) with
the camera projection (x,y,z). (choose your favorite algorithm,
i.e
Then Find the projection of your head point to the floor plane
(example)
Finally, you can convert it to the floor coordinate system or just
keep it in the camera system
From the description of your intended application, it is probably more useful for you to recover the image coordinates, I guess.
This type of problems usually benefits from clearly defining the variables.
In this case, you have a head at physical position {x,y,z} and you want the ground projection {x,y,0}. That's trivial, but your camera gives you {u,v,d} (d being depth) and you need to transform that to {x,y,z}.
The easiest solution to find the transform for a given camera positioning may be to simply put known markers on the floor at {0,0,0}, {1,0,0}, {0,1,0} and see where they pop up in your camera.

OpenCV: Calculate angle between camera and pixel

I'd like to know how I can go about calculating the angle of some pixel in a photo relative to the webcam that I'm using. I'm new to this sort of thing and I'm using a webcam. Essentially, I take a photo, process it, and I end up with a pixel value in the image that is what I'm looking for. I then need to somehow turn that pixel value into some meaningful quantity---I need to find a line/vector that passes through the pixel and the camera. I don't need magnitude, just phase.
How does one go about doing this? Is camera calibration necessary? I've been reading a bit about it but am unsure.
Thanks
You don't need to know the distance to the object, only the resolution and angle of view of the camera.
Computing the angle requires only simple linear interpolation. For example, let's assume a camera with a resolution of 1920x1080 that covers a 45 degree angle of view across the diagonal.
In this case, sqrt(19202 + 10802) gives 2292.19 pixels along the diagonal. That means each pixel represents 45/2292.19 = .0153994 degrees.
So, compute the distance from the center (in pixels), multiply by .0153994, and you have its angle from the center (for that camera -- for yours, you'll obviously have to use its resolution and angle of view).
Of course, this is somewhat approximate -- its accuracy will depend on how much distortion the lens has. With a zoom lens (especially wider angle) you can generally count on that being fairly high. With a fixed focal length lens (especially if it doesn't cover an angle wider than 90 degrees or so) it'll usually be pretty low.
If you want to improve accuracy, you can start by taking a picture of a flat rectangle with straight lines just inside the angle of view of the camera, then compute the distortion based on the deviation from perfectly straight in the resulting picture. If you're working with an extremely wide angle lens, this may be nearly essential. With a lens covering a narrower angle of view (especially, as already mentioned, if it's fixed focal length) it's rarely likely to be worthwhile (such lenses often have only a fraction of a percent of distortion).
Recipe:
1 - Calibrate the camera, obtaining the camera matrix K and distortion parameters D. In OpenCV this is done as described in this tutorial.
2 - Remove the nonlinear distortion from the pixel positions of interest. In OpenCV is done using the undistortPoints routine, without passing arguments R and P.
3 - Back-project the pixels of interest into rays (unit vectors with the tail at the camera center) in camera 3D coordinates, by multiplying their pixel positions in homogeneous coordinates times the inverse of the camera matrix.
4 - The angle you want is the angle between the above vectors and (0, 0, 1), the vector associated to the camera's focal axis.