Perspective Projection and Z-buffering of a 3D head to form a face image - c++

Input: 53490 3D points and for each point (xyz) and color (rgb) of a head
Output: 2D face image as viewed from a particular position / direction
Platform: Matlab C/C++
After study, I found the steps to be implemented
Perspective Projection http://en.wikipedia.org/wiki/3D_projection
Z-buffering http://en.wikipedia.org/wiki/Z-buffering
Phong reflection model http://en.wikipedia.org/wiki/Phong_reflection_model
I implemented the above 3 steps in Matlab. But it takes 8 min for the execution. The 2D rendering is part of my project; I will be calling the 2D rendering part 5000 times later. I want the execution time under 1sec.
The bulk (99.9%) of time is taking for z-buffering. The implementation is done following the wiki link.
Can anyone help me to reduce the time in Matlab or suggest other platform?
Any tutorials/demo references to understand the above steps will be helpful.
Thanks in advance

I don't recommend you doing this in matlab, because you may need to visualize a big volume.
Try vtk, and you may need some programming.
Here is a simple one (3D project) from ImageJ: http://imagejdocu.tudor.lu/doku.php?id=gui:image:stacks

Related

Perspective projection based on 4 points in 2D

I'm writing to ask about homography and perspective projection.
I'm trying to write a piece of code, that will "warp" my image so that its corners align with 4 reference points that are in the 3D space - however, the game engine that I'm running it in, already allows me to get the screen position of them, so I already have their screen-space coordinates of both xi,yi and ui,vi, normalized to values between 0 and 1.
I have to mention that I don't have a degree in mathematics, which seems to be a requirement in the posts I've seen on this topic so far, but I'm hoping there is actually a solution to this problem that one can comprehend. I never had a chance to take classes in Computer Vision.
The reason I came here is that in all the posts I've seen online, the simple explanation that I came across is that each point must be put into a 1x3 matrix and multiplied by a 3x3 homography, which consists of 9 components h1,h2,h3...h9, and this transformation matrix will transform each point to the correct perspective. And that's where I'm hitting a brick wall - how do I calculate the transformation matrix? It feels like it should be a relatively simple algebraic task, but apparently it's not.
At this point I spent days reading on the topic, and the solutions I've come across are either based on matlab (which have a ton of mathematical functions built into them), or include elaborations and discussions that don't really explain much; sometimes they suggest tons of different parameters and simplifications, but rarely explain why and what's their purpose, or they are referencing books and studies that have been since removed from the web, and I found myself more confused than I was in the beginning. Most of the resources I managed to find online are also made in a different context - image stitching and 3d engine development.
I also want to mention that I need to run this code each frame on the CPU, and I'm fairly concerned about the effect of having to run too many matrix transformations and solving a ton of linear algebra equations.
I apologize for not asking about any specific code, but my general question is - can anyone point me in the right direction with this issue?
Limit the problem you deal with.
For example, if you always warp the entire rectangular image, you can treat that the coordinates of the image corners are {(0,0), (1,0), (0,1), (1,1)}.
This can simplify the equation, and you'll be able to solve the equation by yourself.
So you'll be able to implement the answer.
Note : Homograpy is scale invariant. So you can decrease the freedom to 8. (e.g. you can solve the equation under h9=1).
Best advice I can give: read a good book on the subject. For example, "Multiple View Geometry" by Hartley and Zisserman

GLSL truncated signed distance representation (TSDF) implementation

I am looking forward to implement a model reconstruction of RGB-D images. Preferred on mobile phones. For that I read, it is all done with an TSDF-representation. I read a lot of papers now over hierarchical structures and other ideas to speed this up, but my problem is, that I still hat no clue how to actually implement this representation.
If I have a volume grid of size n, so n x n x n and I want to store in each voxel the signed distance, weight and color information. My only guess is, that I have to build a discrete set of points, for each voxel position. And with GLSL "paint" all these points and calculate the nearest distance. But that don't seem quite good or efficient to calculate this n^3 times.
How can I imagine to implement such a TSDF-representation?
The problem is, my only idea is to render the voxel grid to store in the data of signed distances. But for each depth map I have to render again all voxels and calculate all distances. Is there any way to render it the other way around?
So can't I render the points of the depth map and store informations in the voxel grid?
How is the actual state of art to render such a signed distance representation in an efficient way?
You are on the right track, it's an ambitious project but very cool if you can do it.
First, it's worth getting a good idea how these things work. The original paper identifying a TSDF was by Curless and Levoy and is reasonably approachable - a copy is here . There are many later variations but this is the starting point.
Second, you will need to create nxnxn storage as you have said. This very quickly gets big. For example, if you want 400x400x400 voxels with RGB data and floating point values for distance and weight then that will be 768MB of GPU memory - you may want to check how much GPU memory you have available on a mobile device. Yup, I said GPU because...
While you can implement toy solutions on CPU, you really need to get serious about GPU programming if you want to have any kind of performance. I built an early version on an Intel i7 CPU laptop. Admittedly I spent no time optimising it but it was taking tens of seconds to integrate a single depth image. If you want to get real time (30Hz) then you'll need some GPU programming.
Now you have your TSFD data representation, each frame you need to do this:
1. Work out where the camera is with respect to the TSDF in world coordinates.
Usually you assume that you are the origin at time t=0 and then measure your relative translation and rotation with regard to the previous frame. The most common way to do this is with an algorithm called iterative closest point (ICP) You could implement this yourself or use a library like PCL though I'm not sure if they have a mobile version. I'd suggest you get started without this by just keeping your camera and scene stationary and build up to movement later.
2. Integrate the depth image you have into the TSDF This means updating the TSDF with the next depth image. You don't throw away the information you have to date, you merge the new information with the old.
You do this by iterating over every voxel in your TSDF and :
a) Work out the distance from the voxel centre to your camera
b) Project the point into your depth camera's image plane to get a pixel coordinate (using the extrinsic camera position obtained above and the camera intrinsic parameters which are easily searchable for Kinect)
c) Lookup up the depth in your depth map at that pixel coordinate
d) Project this point back into space using the pixel x and y coords plus depth and your camera properties to get a 3D point corresponding to that depth
e) Update the value for the current voxel's distance with the value distance_from_step_d - distance_from_step_a (update is usually a weighted average of the existing value plus the new value).
You can use a similar approach for the voxel colour.
Once you have integrated all of your depth maps into the TSDF, you can visualise the result by either raytracing it or extracting the iso surface (3D mesh) and viewing it in another package.
A really helpful paper that will get you there is here. This is a lab report by some students who actually implemented Kinect Fusion for themselves on a PC. It's pretty much a step by step guide though you'll still have to learn CUDA or similar to implement it
You can also check out my source code on GitHub for ideas though all normal disclaimers about fitness for purpose apply.
Good luck!
After I posted my other answer I thought of another approach which seems to match the second part of your question but it definitely is not reconstruction and doesn't involve using a TSDF. It's actually a visualisation but it is a lot simpler :)
Each frame you get an RGB and a Depth image. Assuming that these images are registered, that is the pixel at (x,y) in the RGB image represents the same pixel as that at (x,y) in the depth image then you can create a dense point cloud coloured using the RGB data. To do this you would:
For every pixel in the depth map
a) Use the camera's intrinsic matrix (K), the pixel coordinates and the depth value in the map at that point to project the point into a 3D point in camera coordinates
b) Associate the RGB value at the same pixel with that point in space
So now you have an array of (probably 640x480) structures like {x,y,z,r,g,b}
You can render these using on GLES just by creating a set of vertices and rendering points. There's a discussion on how to do this here
With this approach you throw away the data every frame and redo from scratch. Importantly, you don't get a reconstructed surface, and you don't use a TSDF. You can get pretty results but it's not reconstruction.

2D-Visibility/Light - Efficient Polygon-Ray intersection

Im trying to write a game in 2D with Sfml. For that game i need a Lightengine and some code that can give me the area of the world that is visible to the player. AS both problems fit very well together (are pratically the same) i would like to solve both problems at once.
My world will be loaded from files in which the hitboxes of objects will be represented as Polygons.
I now wrote some code that takes a list of Polygons and the Direction of a Ray that follows the mouse and finds the closest intersection with any of these polygons.
The next step now would be to cast rays from the players or lights Position towards the edges of the polygons, aswell rays offset by +-0.000001 radians to determine the visible area and give it back as a polygon.
The Problem though is that my algorithm (it calculates the inersection between two lines with vector mathematics) is too slow.
In my very good PC i get 100fps with 300 egdes and one Ray.
I now read many articles online but couldnt find one best solution. But as far as i read it should be much faster to calculate intersections with triangles.
My question now: would it be meaningly faster to triangulate the polygons once while loading the map and then use ray-triangle intersection or is there any better way that you know of to solve my problem?
I also heard of bounding Volumen hierachies but i dont know howmuch impact that would have.
Im a bit surprised of how much power my algorithm consumes, as it only has to calculate some 2 dimensional intersections...
For everyone looking for the solution I finally went with:
I discovered the Box2D Physics Engine and I am now using the b2World::RayCast(...) function to determine whether and where a ray hits an object in my scene.
For now everything works fine and smooth (did no exact benchmark yet) :)
http://www.iforce2d.net/b2dtut/world-querying
I got it to work with the help of this site
Have a nice Day! :)

Augmented Reality OpenGL+OpenCV

I am very new to OpenCV with a limited experience on OpenGL. I am willing to overlay a 3D object on a calibrated image of a checkerboard. Any tips or guidance?
The basic idea is that you have 2 cameras: one is the physical one (the one where you are retriving the images with opencv) and one is the opengl one. You have to align those two matrices.
To do that, you need to calibrate the physical camera.
First. You need a distortion parameters (because every lens more or less has some optical distortion), and build with those parameters the so called intrinsic parameters. You do this with printing a chessboard in a paper, using it for get some images and calibrate the camera. It's full of nice tutorial about that on the internet, and from your answer it seems you have them. That's nice.
Then. You have to calibrate the position of the camera. And this is done with the so called extrinsic parameters. Those parameters encoded the position and the rotation the the 3D world of those camera.
The intrinsic parameters are needed by the OpenCV methods cv::solvePnP and cv::Rodrigues and that uses the rodrigues method to get the extrinsic parameters. This method get in input 2 set of corresponding points: some 3D knowon points and their 2D projection. That's why all augmented reality applications need some markers: usually the markers are square, so after detecting it you know the 2D projection of the point P1(0,0,0) P2(0,1,0) P3(1,1,0) P4(1,0,0) that forms a square and you can find the plane lying on them.
Once you have the extrinsic parameters all the game is easily solved: you just have to make a perspective projection in OpenGL with the FoV and the aperture angle of the camera from intrinsic parameter and put the camera in the position given by the extrinsic parameters.
Of course, if you want (and you should) understand and handle each step of this process correctly.. there is a lot of math - matrices, angles, quaternion, matrices again, and.. matrices again. You can find a reference in the famous Multiple View Geometry in Computer Vision from R. Hartley and A. Zisserman.
Moreover, to handle correctly the opengl part you have to deal with the so called "Modern OpenGL" (remember that glLoadMatrix is deprecated) and a little bit of shader for loading the matrices of the camera position (for me this was a problem because I didn't knew anything about it).
I have dealt with this some times ago and I have some code so feel free to ask any kind of problems you have. Here some links I found interested:
http://ksimek.github.io/2012/08/14/decompose/ (really good explanation)
Camera position in world coordinate from cv::solvePnP (a question I asked about that)
http://www.morethantechnical.com/2010/11/10/20-lines-ar-in-opencv-wcode/ (fabulous blog about computer vision)
http://spottrlabs.blogspot.it/2012/07/opencv-and-opengl-not-always-friends.html (nice tricks)
http://strawlab.org/2011/11/05/augmented-reality-with-OpenGL/
http://www.songho.ca/opengl/gl_projectionmatrix.html (very good explanation on opengl camera settings basics)
some other random usefull stuffs:
http://docs.opencv.org/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html (documentation, always look at the docs!!!)
Determine extrinsic camera with opencv to opengl with world space object
Rodrigues into Eulerangles and vice versa
Python Opencv SolvePnP yields wrong translation vector
http://answers.opencv.org/question/23089/opencv-opengl-proper-camera-pose-using-solvepnp/
Please read them before anything else. As usual, once you got the concept it is an easy joke, need to crash your brain a little bit against the wall. Just don't be scared from all those math : )

OpenGL Dynamic Object Motion Blur

I've been following the GPU Gems 3 tutorial on how to blur based on camera movement. However I'm wanting to implement a blur based on object movement too. The solution is presented in the article (see quote below), however I'm curious as to how exactly to implement this.
At the moment I'm multiplying the object's matrix by the view-projection, then separately again for the previous-view-projection and then passing them into the pixel shader to calculate the velocity instead of just the view-projections.
If that is in fact the correct method, then why am I not simply able to pass in the model-view-projection? I would have assumed they would be the same value?
GPU Gems 3 Motion Blur
To generate a velocity texture for rigid dynamic objects, transform the object by using the current frame's view-projection matrix and the last frame's view-projection matrix, and then compute the difference in viewport positions the same way as for the post-processing pass. This velocity should be computed per-pixel by passing both transformed positions into the pixel shader and computing the velocity there.
Check out my research I did a few months ago on this topic: https://slu-files.s3.us-east-1.amazonaws.com/Fragment_shader_dynamic_blur.pdf
(source: stevenlu.net)
(source: stevenlu.net)
Sadly I did not implement textured objects when producing this material, but do use your imagination. I am working on a game engine so when that finally sees the light of day in the form of a game, you can be sure that I'll come and post breadcrumbs here.
It primarily addresses how to implement this effect in 2D, and in cases where objects do not overlap. There is not really a good way to handle using a fragment shader to "sweep" samples in order to generate "accurate" blur. While the effect approaches pixel-perfection as the sample count is cranked up, the geometry that must be generated to cover the sweep area has to be manually assembled using some "ugly" techniques.
In full 3D it's a rather difficult problem to know which pixels a dynamic object will sweep over during the course of a frame. Even with static geometry and a moving camera the solution proposed by the GPU Gems article is incorrect when moving past things quickly because it is unable to address that issue of requiring blending of the area swept out by something moving...
That said, if this approximation which neglects the sweep is sufficient (and it may be) then what you can do to extend to dynamic objects is to take their motion into account. You'll need to work out the details of course but look at lines 2 and 5 in the second code block on the article you linked: They are the current and previous screen space "positions" of the pixel. You simply have to somehow pass in the matrices that will allow you to compute the previous position of each pixel, taking into account the dynamic motion of your object.
It shouldn't be too bad. In the pass where you render your dynamic object you send in an extra matrix that represents its motion over the last frame.
Update:
I found that this paper describes an elegant and performant approach that provides somewhat high quality physically correct blurring for a 3D pipeline. It'll be hard to do much better than this within the constraint of rendering the full scene no more than one time for performance reasons.
I noticed with some of the examples the quality of the velocity buffer could be better. for example a rotating wheel should have some curves in the velocity space. I believe if they can be set properly (may require custom fragment shaders to render the velocity out...) they will look intuitively correct like the spinning cube seen above from my 2D exploration into dynamic motion blurring.