I'm writing to ask about homography and perspective projection.
I'm trying to write a piece of code, that will "warp" my image so that its corners align with 4 reference points that are in the 3D space - however, the game engine that I'm running it in, already allows me to get the screen position of them, so I already have their screen-space coordinates of both xi,yi and ui,vi, normalized to values between 0 and 1.
I have to mention that I don't have a degree in mathematics, which seems to be a requirement in the posts I've seen on this topic so far, but I'm hoping there is actually a solution to this problem that one can comprehend. I never had a chance to take classes in Computer Vision.
The reason I came here is that in all the posts I've seen online, the simple explanation that I came across is that each point must be put into a 1x3 matrix and multiplied by a 3x3 homography, which consists of 9 components h1,h2,h3...h9, and this transformation matrix will transform each point to the correct perspective. And that's where I'm hitting a brick wall - how do I calculate the transformation matrix? It feels like it should be a relatively simple algebraic task, but apparently it's not.
At this point I spent days reading on the topic, and the solutions I've come across are either based on matlab (which have a ton of mathematical functions built into them), or include elaborations and discussions that don't really explain much; sometimes they suggest tons of different parameters and simplifications, but rarely explain why and what's their purpose, or they are referencing books and studies that have been since removed from the web, and I found myself more confused than I was in the beginning. Most of the resources I managed to find online are also made in a different context - image stitching and 3d engine development.
I also want to mention that I need to run this code each frame on the CPU, and I'm fairly concerned about the effect of having to run too many matrix transformations and solving a ton of linear algebra equations.
I apologize for not asking about any specific code, but my general question is - can anyone point me in the right direction with this issue?
Limit the problem you deal with.
For example, if you always warp the entire rectangular image, you can treat that the coordinates of the image corners are {(0,0), (1,0), (0,1), (1,1)}.
This can simplify the equation, and you'll be able to solve the equation by yourself.
So you'll be able to implement the answer.
Note : Homograpy is scale invariant. So you can decrease the freedom to 8. (e.g. you can solve the equation under h9=1).
Best advice I can give: read a good book on the subject. For example, "Multiple View Geometry" by Hartley and Zisserman
Related
I realize there are many cans of worms related to what I'm asking, but I have to start somewhere. Basically, what I'm asking is:
Given two photos of a scene, taken with unknown cameras, to what extent can I determine the (relative) warping between the photos?
Below are two images of the 1904 World's Fair. They were taken at different levels on the wireless telegraph tower, so the cameras are more or less vertically in line. My goal is to create a model of the area (in Blender, if it matters) from these and other photos. I'm not looking for a fully automated solution, e.g., I have no problem with manually picking points and features.
Over the past month, I've taught myself what I can about projective transformations and epipolar geometry. For some pairs of photos, I can do pretty well by finding the fundamental matrix F from point correspondences. But the two below are causing me problems. I suspect that there's some sort of warping - maybe just an aspect ratio change, maybe more than that.
My process is as follows:
I find correspondences between the two photos (the red jagged lines seen below).
I run the point pairs through Matlab (actually Octave) to find the epipoles. Currently, I'm using Peter Kovesi's
Peter's Functions for Computer Vision.
In Blender, I set up two cameras with the images overlaid. I orient the first camera based on the vanishing points. I also determine the focal lengths from the vanishing points. I orient the second camera relative to the first using the epipoles and one of the point pairs (below, the point at the top of the bandstand).
For each point pair, I project a ray from each camera through its sample point, and mark the closest covergence of the pair (in light yellow below). I realize that this leaves out information from the fundamental matrix - see below.
As you can see, the points don't converge very well. The ones from the left spread out the further you go horizontally from the bandstand point. I'm guessing that this shows differences in the camera intrinsics. Unfortunately, I can't find a way to find the intrinsics from an F derived from point correspondences.
In the end, I don't think I care about the individual intrinsics per se. What I really need is a way to apply the intrinsics to "correct" the images so that I can use them as overlays to manually refine the model.
Is this possible? Do I need other information? Obviously, I have little hope of finding anything about the camera intrinsics. There is some obvious structural info though, such as which features are orthogonal. I saw a hint somewhere that the vanishing points can be used to further refine or upgrade the transformations, but I couldn't find anything specific.
Update 1
I may have found a solution, but I'd like someone with some knowledge of the subject to weigh in before I post it as an answer. It turns out that Peter's Functions for Computer Vision has a function for doing a RANSAC estimate of the homography from the sample points. Using m2 = H*m1, I should be able to plot the mapping of m1 -> m2 over top of the actual m2 points on the second image.
The only problem is, I'm not sure I believe what I'm seeing. Even on an image pair that lines up pretty well using the epipoles from F, the mapping from the homography looks pretty bad.
I'll try to capture an understandable image, but is there anything wrong with my reasoning?
A couple answers and suggestions (in no particular order):
A homography will only correctly map between point correspondences when either (a) the camera undergoes a pure rotation (no translation) or (b) the corresponding points are all co-planar.
The fundamental matrix only relates uncalibrated cameras. The process of recovering a camera's calibration parameters (intrinsics) from unknown scenes, known as "auto-calibration" is a rather difficult problem. You'd need these parameters (focal length, principal point) to correctly reconstruct the scene.
If you have (many) more images of this scene, you could try using a system such as Visual SFM: http://ccwu.me/vsfm/ It will attempt to automatically solve the Structure From Motion problem, including point matching, auto-calibration and sparse 3D reconstruction.
To give you an idea of where I'm coming from, this started as a teaching exercise to get a 12-year-old video game addict into coding. The 2D games, I did in SDL with him and that was fine because I wasn't planning on going into 3D. Yeah, right! So now I'm in at the deep end in OpenGL and mainly trying to figure out exactly what it can and cannot do. I understand the theory (still working on beziers and nurbs if the truth be told) and could code the whole thing by hand in calculated triangular vertices but I'd hate to spend days on that only to be told that there's a built in function/library that does the whole thing faster and easier.
Quadrics seem to be extremely powerful but not terribly flexible. Consider the human head - roughly speaking a 3x4x3 sphere or a torso as a truncated cone that's taller than it is wide than it is thick. Again, a quadric shape with independent x,y and z radii. Since only one radius is provided, am I right in thinking that I would have to generate it around the origin and then apply a scaling matrix to adjust them? Furthermore, if this is so, am I also correct in thinking that saving the results into a vertex array rather than a frame list results in the system neither knowing or caring how they got there?
Transitions: I'm familiar with the basic transitions but, again, consider the torso. It can achieve, maybe, a 45 degree twist from the hips to the shoulders that is distributed linearly across the entire length or even the sideways lean. This is applied around the Y or Z axis respectively but I've obviously missed something about applying transformations that are based on an independent value. (eg rot = dist x (max_rot/max_dist). Again, I could do this by hand (and will probably have to in order to apply the correct physics) but does OpenGL have this functionality built in somewhere?
Any other areas of research I need to put in would be appreciated in the notes.
Ok I am posting my conundrums of life to stackoverflow after 4 days of mindless programming when nothing seems to get things right or atleast close to right. sorry for being a little dramatic but I feel like a lousy programmer today.
Anyway, my problem is:
To obtain Fundamental matrix using RANSAC (N>8).
I have two images with wide baseline but sufficient overlap so that adequate amount of SURF keypoints (~308) are matched correctly (i plot them).
Now lies the problem. I pass the 2D points to cv::findFindamentalMat but I get completly baseless results. The function returns:
FundMat=[2.05148e-13 3.72341 -2.03671e+10
1.6701e+26 -4.17712 4.59533e+29
3.32414e+18 2.8843 1.91069e-26]
To circumvent the large dynamic range of the matrix, Hartley suggested to normalise the data points (in euclidean space and not the projection space normalization)....Even after doing that the result is the almost the same. (10^-9 to 10^9)
I understand that FundMat is accurate only upto scale but a difference of 10^-9 to 10^+9 is too much.
I referred to other questions here but i dont seem to get any leads:findfundamentalmatrix-doesnt-find-fundamental-matrix
how-to-calculate-the-fundamental-matrix-for-stereo-vision
Any ideas would be great. This is a very important step when considering uncalibrated images for the rest of the software pipeline.
n case the code is helpful. (its not indented and colored though..space is too less here.)
https://sites.google.com/site/3drecon124/
its solved...silly human error. there was a data type conversion from double to float and it caused data to be fetched from incorrect locations in memory. now its smooth and epipolar constraint is satisfied upto scale.
I'm working with eigenfaces for a facial recognition program I am writing. I have a couple questions about how eigenfaces are actually generated:
Are they generated from a lot of pictures of different people, or a lot of pictures of the same person?
Do these people need to include the people you want to recognize? If not, then how would any type of comparison be made?
Is an eigenface determined for every image you provide, or do multiple pictures go towards creating one eigenface?
This is all about the generation or learning phase of the eigenfaces. Thanks for any help or pointing me in the right direction!
I actually find the description for Eigenfaces on Wikipedia quite useful. To answer your questions:
Yes, you should take pictures from many different people.
No, the eigenfaces basically give you a way to describe other faces. You can think of the eigenfaces as a basis in a vector space. You have to make sure that you can describe the face that you want to recognise with the eigenfaces that you have. If you only use Caucasian faces to determine the eigenfaces, you might have problems describing a variety of Asian faces with them and vice versa.
The eigenfaces are computed from a set of images, i.e. multiple images lead to multiple eigenfaces.
Edit: Answering the question, that Kevin added in the comment to the question:
The idea behind using eigenfaces, is that you can express an image of a face by mixing eigenfaces together. Let's suppose you have three eigenfaces ef_1, ef_2, ef_3 and you have an image of a face f_1 = a_1 * ef_1 + a_2 * ef_2 + a_3 * ef_3. The eigenfaces do not change, regardless which face you want to express with them, however, the coefficients a = (a_1, a_2, a_3) are characteristic to the face. This is what you would use to compare two faces.
But in order to get to the stage where you can use eigenfaces, you first have to align (register) an observed face with the eigenfaces, which is not trivial and a completely different topic (see pxu's answer).
P.S.: I recommend, that you keep an eye on Area 51: Computer Vision, which is a Stack Overflow sister site about computer vision in the making.
Many different people are highly necessary to achieve support to cover all possible faces.
No need for that, although you need to represent all dimensions. A good analogy is to barycentric coordinates for describing the location of a point in a triangle. You are getting a weighted average to the vertices. If you don't have sufficient vector support (for example, only having two points), then you can't describe points that lie outside the line no matter how you play with the weighted average. This is essentially bjoernz's point for Caucasian vs. Asian faces. Note that this analogy is a gross simplification. The weights in eigenfaces are actually more like PCA or Fourier coefficients.
Each image gets turned into an eigenface which is a vector of principal components.
Nota bene: you need very good registration of the faces. Eigenfaces is notoriously bad about translation/rotation invariance. Your results are likely to be terrible unless you register well. The original Turk and Pentland paper was groundbreaking not just because of the technique but for the scale and quality of data set they gathered which enabled said technique.
I apologize for the length of this question and give a pre-emptive thanks for anyone who reads through this!
So i've spent the last few days going over the GJK algorithm. I understand the general concepts behind it, and understand the most of the nitty gritties of its implementation in 2D thanks to the wonderful article by William Bittle at http://www.codezealot.org/archives/88 .
I've implemented his pseudo code (found at the end of the article) into my own c++ project, however i want to make a 3D implementation. My weakness comes into using the dot products to test the voronoi regions and the tripleProducts to get perpandicular lines. But im trying to read up more on that.
My problem comes down to the containsOrigin function. Im having trouble visualizing and accounting for the new voronoi regions that the z axis adds. I just can't seem to wrap my head around how to determine which regions contains the origin. I assume there is 4 I have to account for, each extending from the triangular planes that the comprise the 4 faces of the tetrahedron simplex. If the origin is not within any of those regions, then it is contained, and we have a collision.
How do i go about testing if it is contained in a particular voronoi region/ which triangular face is pointing in the direction of the origin?
The current 2D algorithm checks if a triangle is made, if not, then the simplex is a line and it finds the 3rd point. I assume the 3D algorithm with check if a tetrahedron is made, if not, then it will check for a triangle, if true then it will to find a 4th point to make a tetrahedron(how would i get this? using a normal in direction of origin?). If i trangle isnt made, it will find a 3rd point to make a triangle (do i still use triple product for this like in 2D?).
Any suggestions, outlines, resources, code augmentations, comments are much appretiated.
Depending on what result you expect from the GJK algorithm you might want to look at this nice tutorial from Molly Rocket: https://mollyrocket.com/849
Be aware though that his implementation only outputs intersection? yes/no. But it might be a nice start.