Creating a 3D model of object with non-overlapping stereo cameras? - computer-vision

Say I wanted to create a 3D model of an object, and I have a pair of cameras that will take a photo of the object at a given distance (the position of the cameras is fixed relative to the object itself.) However, the Field Of View of the cameras is such that not all of the object is visible for each camera; only part of the object is visible for both cameras. See the figure below for what I mean.
Is it possible to create a 3D model of the object using the images from the two cameras, even thought they do not completely see the entirety of the model? If so, how can this be done?

Related

Fittest polygon bounding objects in an image

Is there any method to create a polygon(not a rectangle) around an object in an image for object recognition.
Please refer the following images:
the result I am looking for
and
the original image
.
I am not looking for bounding rectangles like this.I know the concepts of transfer learning, using pre-trained models for object recognition and other object detection concepts.
The main aim is the object detection but not giving results using bounding box but a fitter polygon instead.Link to some resources or papers will be helpful.
Here is a very simple (and a bit hacky) idea, but it might help: take a per-pixel scene labeling algorithm, e.g. SegNet, and then turn the resulting segmented image into a binary image, where the white pixels are the class of interest (in your example, white for cars and black for the rest). Now compute edges. You can add those edges to the original image to obtain a result similar to what you want.
What you want is called image segmentation, which is different to object detection. The best performing methods for common object classes (e.g. cars, bikes, people, dogs,...) do this using trained CNNs, and are usually called semantic segmentation networks awesome links. This will, in theory, give you regions in your image corresponding to the object you want. After that you can fit an enclosing polygon using what is called the convex hull.

How to place 3D objects in a scene?

I'm developing a simple rendering engine as a pet project.
So far I'm able to load geometry data from Wavefront .obj files and render them onscreen separately. I know that vertex coordinates stored in these files are defined in Model space and to place them correctly in the scene I need to apply Model-to-world transform matrix to each vertex position (am I even correct here?).
But how do I define those matrices for each object? Do i need to develop a separate tool for scene composition, in which I will move objects around and the "tool" will calculate appropriate Model-to-world matrices based on translations, rotations an so on?
I would look into the "Scene Graph" data structure. It's essentially a tree, where nodes (may) define their transformations relative to their parent. Think of it this way. Each of your fingers moves relative to your hand. Moving your hand, rotating or scaling it also involves doing the same transformation on your fingers.
It is therefore beneficial to base all these relative transformations on one another as relative ones, and combine trhem to determine the overall transformation of each individual part of your model. As such you don't just define the direct model to view transformation, but rather a transformation from each part to its parent.
This saves having to define a whole bunch of transformations yourself, which are in the vast majority of cases similarly in the way I described anyway. As such you save yourself a lot of work by representing your models/scene in this manner.
Each of these relative transformations is usually a 4x4 affine transformation matrix. Combining these is just a matter of multiplying them together to obtain the combination of all of them.
A description of Scene Graphs
In order to animate objects within a scene graph, you need to specify transformations relative to their parent in the tree. For instance, spinning wheels of a car need to rotate relative to the car's chassis. These transformations largely depend on what kind of animations you'd like to show.
So I guess the answer to your question is "mostly yes". You do need to define transformations for every single object in your scene if things are going to look good. However, orgasnising the scene into a tree structure makes this process a lot easier to handle.
Regarding the creation of those matrices what you have to do is to export a scene from an authoring package.
That software can be the same you used to model the objects in the first place, Maya, Lightwave...
Right now you have your objects independent of each other.
So, using the package of your choice, either find a file format allowing you to export a scene you would have made by positioning each of your meshes where you want them, like FBX or GLTF or make your own.
Either way there is a scene structure, containing models, transforms, lights, cameras, everything you want in your engine.
After that you have to parse that structure.
You'll find here some explanations regarding how you could architect that:
https://nlguillemot.wordpress.com/2016/11/18/opengl-renderer-design/
Good luck,

general scheme of 3d geometry storage and usage in OpenGL or DirectX

I know OpenGL only slight and all this docs and tutorials are damn hard to read so i do not helps.. I got some vision though how it could work and only would like some clarification or validation of my vision
I assume 3D world is build from 3d meshes, each mesh may be hold in some array or few arrays (storing the geometry for that mesh).. I assume also that some meshes may be sorta like cloned and used more than once on the scene.. So in my wision i got say 50 meshes but some of them are used more than once... Lets say those clones i would name as a instance of a mesh (each mesh may have 0 instances, 1 instance or more instances)
Is this vision okay? Some more may be added?
I understand that each instance should have its own position and orientation, so do we have some array of instances each element containing one pos-oriantation matrix? or thiose matrices only existing in the code branches (you know what i mean, i set such matrix then send a mesh then modify this position matrix then send the mesh again till all instances are sent) ?
Do this exhaust the geomety (non-shader) part of the things?
(then shaders part come which i also not quite understand, there is a tremendous amount of hoax on shaders where this geometry part seem more important to me, well whatever)
Can someone validate the vision i spread here?
So you have a model which will contain one or more meshes, a mesh that will contain one or more groups, and a group that will contain vertex data.
There is only a small difference between a model and a mesh such as a model will contain other data such as texture which will be used by a mesh (or meshes).
A mesh will also contain data on how to draw the groups such as a matrix.
A group is a part of the mesh which is generally used to move a part of the model using sub matrices. Take a look at "skeletal animation".
So as traditional fixed pipelines suggest you will usually have a stack of matrices which can be pushed and popped to define somewhat "sub-positions". Imaging having a model representing a dragon. The model would most likely consist of a single mesh, a texture and maybe some other data on the drawing. In the runtime this model would have some matrix defining the model basic position and rotation, even scale. Then when the dragon needs to fly you would move its wings. Since the wings may be identical there may be only 1 group but the mesh would contain data to draw it twice with a different matrix. So the model has the matrix which is then multiplied with the wing group matrix to draw the wing itself:
push model matrix
multiply with the dragon matrix
push model matrix
multiply with the wing matrix
draw wing
pop matrix
push matrix
multiply with the second wing matrix
draw second wing
pop matrix
... draw other parts of the dragon
pop matrix
You can probably imagine the wing is then divided into multiple parts each again containing an internal relative matrix achieving a deeper level of matrix usage and drawing.
The same procedures would then be used on other parts of the model/mesh.
So the idea is to put as least data as possible on the GPU and reuse them. So when model is loaded all the textures and vertex data should be sent to the GPU and be prepared to use. The CPU must be aware of those buffers and how are they used. A whole model may have a single vertex buffer where each of the draw calls will reuse a different part of the buffer but rather just imagine there is a buffer for every major part of the mode such as a wing, a head, body, leg...
In the end we usually come up with something like a shared object containing all the data needed to draw a dragon which would be textures and vertex buffers. Then we have another dragon object which will point out to that model and contain all the necessary data to draw a specific dragon on the scene. That would include the matrix data for the position in the scene, the matrix for the groups to animate the wings and other parts, maybe some size or even some basic color to combine with the original model... Also some states are usually stored here such as speed, some AI parameters or maybe even hit points.
So in the end what we want to do is something like foreach(dragon in dragons) dragon.draw() which will use its internal data to setup the basic model matrices and use any additional data needed. Then the draw method will call out to all the groups, meshes in the model to be drawn as well until the "recursion" is done and the whole model is drawn.
So yes, the structure of the data is quite complicated in the end but if you start with the smaller parts and continue outwards it all fits together quite epic.
There are other runtime systems that need to be handled as well to have a smooth loading. For instance if you are in a game and there are dragons in vicinity you will not have the model for the dragon loaded. When the dragon enters the vicinity the model should be loaded in the background if possible but drawn only when needed (in visual range). Then when the dragon is gone you may not simply unload the model, you must be sure all of the dragons are gone and maybe even wait a little bit if someone might return. This then leads to something much like a garbage collector.
I hope this will help you to a better understanding.

What is the standard place to keep the Model Matrix?

I have a "3D engine" which has a single model matrix.
All of my 3D objects uses this model matrix (for transformations stuff).
For each object i set the model identity before using it.
So far so great as it appears to be working just fine, and as expected.
But now i am suddenly wondering if i have a design flaw.
Should each 3D object (the base object) have their own model matrix?
Is there any value in doing it this way (model matrix per 3D object)?
That is literally what the point of the model matrix is. It is an optional transformation for each object you draw from object-space (its local coordinate system) to world-space (a shared coordinate system). You can define your objects such that their positions are already in world-space and then no model matrix is necessary (as you are doing if you use an identity model matrix).
GL uses (or at least historically, it did) a matrix stack for this and it is technically the very reason it uses column-major matrices and post-multiplication. That allows you to push/pop your local coordinate transformations to the top of the stack immediately before/after you draw an object while leaving the other transformations that rarely change such as world->eye (view matrix) or eye->clip (projection matrix) untouched.
Elaborating in a separate answer because a comment is too short, but Andon's answer is the right one.
Think for instance of loading two 3d meshes done from two different artists. ¹
The first artist assumed that 1 unit in your model space is 1 meter, while the other artist assumed 1 inch.
So, you load a car mesh done by the first artist, which maybe is 3 units long; and a banana mesh, which then is 8 units long.
Also, the first artist decided to put the origin of the points for the car mesh in its center, while the artist who did the banana mesh put the banana lying on the X axis, with the base of the fruit on the X=2000 point.
So how do you show both of this meshes in your 3d world? Are you going to have a banana whose length is almost thrice the length of your car? That makes absolutely no sense.
How are you going to place them next to each other? Or how are you going to place them lying on a plane? The fact that the local coordinate systems are totally random makes it impossible to translate your objects in a coherent way.
This is where the model->world matrix comes in.
It allows you to specify a per-model transformation that brings all models into a "shared", unique, coherent space -- the world space.
For instance, you could translate both models so that their origin would lie "in a corner" and all the mesh's vertices in the first octant. Also, if your world space uses "meters", you would need to scale the banana mesh by 0.0254 to bring its size in meters as well. Also, maybe you'd like to rotate the banana and having it lying on the Y axis instead of the X.
At the end of the game, for each "unique model" in your world, you'd have its associated model matrix, and use it when drawing it.
Also: for each instance of a model, you could think of applying an extra local trasformation (in world coordinates). For instance, you'd like to translate the car model to a certain point in the world. So instead of having
(Model space) --> Model matrix ---> (World space) ---> World matrix ---> (Final world space)
you could multiply the two matrices together and having only one matrix that brings points from model space straight to final world space.
¹ This point is a bit moot in that in any proper asset pipeline the first thing you'd do would be bringing all the models in a coherent coordinate system, just doing an example...

how do I re-project points in a camera - projector system (after calibration)

i have seen many blog entries and videos and source coude on the internet about how to carry out camera + projector calibration using openCV, in order to produce the camera.yml, projector.yml and projectorExtrinsics.yml files.
I have yet to see anyone discussing what to do with this files afterwards. Indeed I have done a calibration myself, but I don't know what is the next step in my own application.
Say I write an application that now uses the camera - projector system that I calibrated to track objects and project something on them. I will use contourFind() to grab some points of interest from the moving objects and now I want to project these points (from the projector!) onto the objects!
what I want to do is (for example) track the centre of mass (COM) of an object and show a point on the camera view of the tracked object (at its COM). Then a point should be projected on the COM of the object in real time.
It seems that projectPoints() is the openCV function I should use after loading the yml files, but I am not sure how I will account for all the intrinsic & extrinsic calibration values of both camera and projector. Namely, projectPoints() requires as parameters the
vector of points to re-project (duh!)
rotation + translation matrices. I think I can use the projectorExtrinsics here. or I can use the composeRT() function to generate a final rotation & a final translation matrix from the projectorExtrinsics (which I have in the yml file) and the cameraExtrinsics (which I don't have. side question: should I not save them too in a file??).
intrinsics matrix. this tricky now. should I use the camera or the projector intrinsics matrix here?
distortion coefficients. again should I use the projector or the camera coefs here?
other params...
So If I use either projector or camera (which one??) intrinsics + coeffs in projectPoints(), then I will only be 'correcting' for one of the 2 instruments . Where / how will I use the other's instruments intrinsics ??
What else do I need to use apart from load() the yml files and projectPoints() ? (perhaps undistortion?)
ANY help on the matter is greatly appreciated .
If there is a tutorial or a book (no, O'Reilly "Learning openCV" does not talk about how to use the calibration yml files either! - only about how to do the actual calibration), please point me in that direction. I don't necessarily need an exact answer!
First, you seem to be confused about the general role of a camera/projector model: its role is to map 3D world points to 2D image points. This sounds obvious, but this means that given extrinsics R,t (for orientation and position), distortion function D(.) and intrisics K, you can infer for this particular camera the 2D projection m of a 3D point M as follows: m = K.D(R.M+t). The projectPoints function does exactly that (i.e. 3D to 2D projection), for each input 3D point, hence you need to give it the input parameters associated to the camera in which you want your 3D points projected (projector K&D if you want projector 2D coordinates, camera K&D if you want camera 2D coordinates).
Second, when you jointly calibrate your camera and projector, you do not estimate a set of extrinsics R,t for the camera and another for the projector, but only one R and one t, which represent the rotation and translation between the camera's and projector's coordinate systems. For instance, this means that your camera is assumed to have rotation = identity and translation = zero, and the projector has rotation = R and translation = t (or the other way around, depending on how you did the calibration).
Now, concerning the application you mentioned, the real problem is: how do you estimate the 3D coordinates of a given point ?
Using two cameras and one projector, this would be easy: you could track the objects of interest in the two camera images, triangulate their 3D positions using the two 2D projections using function triangulatePoints and finally project this 3D point in the projector 2D coordinates using projectPoints in order to know where to display things with your projector.
With only one camera and one projector, this is still possible but more difficult because you cannot triangulate the tracked points from only one observation. The basic idea is to approach the problem like a sparse stereo disparity estimation problem. A possible method is as follows:
project a non-ambiguous image (e.g. black and white noise) using the projector, in order to texture the scene observed by the camera.
as before, track the objects of interest in the camera image
for each object of interest, correlate a small window around its location in the camera image with the projector image, in order to find where it projects in the projector 2D coordinates
Another approach, which unlike the one above would use the calibration parameters, could be to do a dense 3D reconstruction using stereoRectify and StereoBM::operator() (or gpu::StereoBM_GPU::operator() for the GPU implementation), map the tracked 2D positions to 3D using the estimated scene depth, and finally project into the projector using projectPoints.
Anyhow, this is easier, and more accurate, using two cameras.
Hope this helps.