Grasp hand pose estimation with Machine learning - computer-vision

What I'm trying to do is..
Dataset: Pictures of a hand holding a stick, and I know the 3d position of joints, or 3D pose of each picture. Pictures are taken from same position, so hand is the only one that moves.
Input: A picture of hand
Output: 3D Hand Pose Is this possible, and if so, how could this be done? Since I'm a newbie to ML, I want to get ideas for good understanding. Thanks!

It should be possible, but it will be a hard research project.
Because the problem requires such complex output, a machine learning approach to the problem will take a huge number of training examples, more than you could generate by hand. A good approach might be to make a small program that can 3D-render an image of a hand in pose x, with random lighting, random hand size, etc. Then feed millions of those training images to a convolutional neural net with deep learning, where the final output neurons encode the poses.
Using that same program, an alternative approach would be to do gradient descent on the pose, repeatedly rendering poses until you get the best match. That's called a generative model. It doesn't involve a neural net, but it would probably be slow. There are no doubt other approaches too.
If you're interested, Microsoft has been working on this problem to enable new types of Xbox Kinect games: https://www.microsoft.com/en-us/research/project/fully-articulated-hand-tracking/
All in all though, if you're new to computer vision and machine learning, I'd recommend starting with simpler challenges first.

Related

random voxel terrain on the gpu (dc vs. cms)

I want to create a random fractal terrain on the gpu (with a compute shader). I've started with implementing marching cubes: Generating Complex Procedural Terrains Using the GPU, and it works really good: marching cubes on the gpu. However, marching cubes can't extract sharp features or use an adaptive resolution.
So I looked for an advanced isosurface extraction algorithm and found Dual Contouring of Hermite Data. So I implemented dc in Java to test this algorithm and it looks great: dc on the cpu. (There are some holes in the mesh and no sharp features, but I was too lazy to implement/fix this because it's only a prototype.)
But I've noticed some negative aspects:
Intercell-dependent. (I have no idea how to port this to the gpu: the only resource I've found is Dual Contouring with OpenCL.)
I don't know how to create a chunk system, because there are no "clear" borders https://i.stack.imgur.com/62dy6.png.
So I continued my search for a better algorithm and found Cubical Marching Squares: Adaptive Feature Preserving Surface Extraction from Volume Data. This seems to be the perfect algorithm for me: intercell-independent, adaptive, sharp features, primal structure and even manifold. Unfortunately, there are no resources on how to implement this algorithm except for this Cubical Marching Squares Implementation. I think I understand the two parts of the algorithm: create a grid, for each cell:
Subdivide until the maximum depth is reached or there is no need to do so.
Split each cell into 6 faces, extract their surface and stitch them together.
But I don't know how to connect those two parts (especially the part with transitional faces, page 38).
So does anybody know how to implement dc as a shader, how to implement cms or a better algorithm (maybe dual marching cubes, I think it has the same problem as dc, but I haven't tested it yet)?
As you have already mentioned for the GPU you would need no intercell-dependencies... I'm sure people have come up with workarounds for that on DC but CMS should be one of the best for all those things that you need, namely intercell-independent (by definition), preserves sharp features, creates manifold geometry, and supports adaptive resolution.
In terms of resources on CMS I agree they are quite limited.
The original paper: http://graphics.csie.ntu.edu.tw/CMS/
This project by Matt Keeter has a 'C' CMS implementation in it (https://www.mattkeeter.com/projects/kokopelli/)
https://github.com/mkeeter/kokopelli <-- code
I used Matt Keeter's implementation as a reference for my partial 'C++' implementation (the thesis that you linked), here is the code for it in case you didn't find it:
https://bitbucket.org/GRassovsky/cubical-marching-squares
However keep in mind that it is indeed partial, that is to say it has the main algorithm working (adaptive and manifold), but I haven't got around to implement the sharp-feature preservation, 2D and 3D disambiguation, etc. It is also currently a basic CPU implementation... I had the good intentions to implement all those things and make a GPU implementation but for now have had no time.
"But I don't know how to connect those two parts" - that only goes to say how badly I have written my thesis, because I try to explain how I do this :D (Mind you, I am not sure exactly how this was done in the original paper... should we say that some things are not very clear there and you would have to use your imagination :))

Generate an image that can be most easily detected by Computer Vision algorithms

Working on a small side project related to Computer Vision, mostly to try playing around with OpenCV. It lead me to an interesting question:
Using feature detection to find known objects in an image isn't always easy- objects are hard to find, especially if the features of the target object aren't great.
But if I could choose ahead of time what it is I'm looking for, then in theory I could generate for myself an optimal image for detection. Any quality that makes feature detection hard would be absent, and all the qualities that make it easy would exist.
I suspect this sort of thought went into things like QR codes, but with the limitations that they wanted QR codes to be simple, and small.
So my question for you: How would you generate an optimal image for later recognition by a camera? What if you already know that certain problems like skew, or partial obscuring would occur?
Thanks very much
I think you need something like AR markers.
Take a look at ArToolkit, ArToolkitPlus or Aruco libraries, they have marker generators and detectors.
And papeer about marker generation: http://www.uco.es/investiga/grupos/ava/sites/default/files/GarridoJurado2014.pdf
If you plan to use feature detection, than marker should be specific to used feature detector. Common practice for detector design is good response to "corners" or regions with high x,y gradients. Also you should note the scaling of target.
The simplest detection can be performed with BLOBS. It can be faster and more robust than feature points. For example you can detect circular blobs or rectangular.
Depending on the distance you want to see your markers from and viewing conditions/backgrounds you typically use and camera resolution/noise you should choose different images/targets. Under moderate perspective from a longer distance a color target is pretty unique, see this:
https://surf-it.soe.ucsc.edu/sites/default/files/velado_report.pdf
at close distances various bar/QR codes may be a good choice. Other than that any flat textured object will be easy to track using homography as opposed to 3D objects.
http://docs.opencv.org/trunk/doc/py_tutorials/py_feature2d/py_feature_homography/py_feature_homography.html
Even different views of 3d objects can be quickly learned and tracked by such systems as Predator:
https://www.youtube.com/watch?v=1GhNXHCQGsM
then comes the whole field of hardware, structured light, synchronized markers, etc, etc. Kinect, for example, uses a predefined pattern projected on the surface to do stereo. This means it recognizes and matches million of micro patterns per second creating a depth map from the matched correspondences. Note that one camera sees the pattern and while another device - a projector generates it working as a virtual camera, see
http://article.wn.com/view/2013/11/17/Apple_to_buy_PrimeSense_technology_from_the_360s_Kinect/
The quickest way to demonstrate good tracking of a standard checkerboard pattern is to use pNp function of open cv:
http://www.juergenwiki.de/work/wiki/lib/exe/fetch.php?media=public:cameracalibration_detecting_fieldcorners_of_a_chessboard.gif
this literally can be done by calling just two functions
found = findChessboardCorners(src, chessboardSize, corners, camFlags);
drawChessCornersDots(dst, chessboardSize, corners, found);
To sum up, your question is very broad and there are multiple answers and solutions. Formulate your viewing condition, camera specs, backgrounds, distances, amount of motion and perspective you expect to have indoors vs outdoors, etc. There is no such a thing as a general average case in computer vision!

Face recognition using neural networks

I am doing a project on face recognition, for that I have already used different methods like eigenface, fisherface, LBP histograms and surf. But these methods are not giving me an accurate result. Surf gives good matches for exact same images, but I need to match one image with it's own different poses(wearing glasses,side pose,if somebody is covering his face) etc. LBP compares histogram of images, i.e., only color informations. So when there is high variation on lighting condition it is not showing good results. So I heard about neural networks, but I don't know much about that. Is it possible to train the system very accurately by using neural networks. If possible how can we do that?
According to this OpenCV page, there does seem to be some support for machine learning. That being said, the support does seem to be a bit limited.
What you could do, would be to:
User OpenCV to extract the face of the person.
Change the image to grey scale.
Try to manipulate so that the face is always the same size.
All the above should be doable with OpenCV itself (could be wrong, haven't messed with OpenCV in a while) so that should save you some time.
Next, you take the image, as a bitmap maybe, and feed the bitmap as a vector to the neural network. Alternatively, as #MatthiasB recommended, you could feed the features instead of individual pixels. This would simplify the data being passed, thus making the network easier to train.
As for training, you manipulate these images as above, and then feed them to the network. If a person uses glasses occasionally, you could have cases of the same person with and without glasses, etc.

How to make rgbdemo working with non-kinect stereo cameras?

I was trying to get RGBDemo(mostly reconstructor) working with 2 logitech stereo cameras, but I did not figure out how to do it.
I noticed that there is a opencv grabber in nestk library and its header file is included in the reconstructor.cpp. Yet, when I try "rgbd-viewer --camera-id 0", it keeps looking for kinect.
My questions:
1. Is RGBDemo only working with kinect so far?
2. If RGBDemo can work with non-kinect stereo cameras, how do I do that?
3. If I need to write my own implementation for non-kinect stereo cameras, any suggestion on how to start?
Thanks in advance.
if you want to do it with non-kinect cameras. You don't even need stereo. There are algorithms now that are able to determine whether two images' viewpoints are sufficiently different that they can be used as if they were taken by a stereo camera. In fact, they use images from different cameras that are found on the internet and reconstruct 3D models of famous places. I can write you a tutorial on how to get it working. I've been meaning to do so. The software is called Bundler. Along with Bundler, people often also use CMVS and PMVS. CMVS preprocesses the images for PMVS. PMVS generates dense clouds.
BUT! I highly recommend that you don't go this route. It makes a lot of mistakes because there is so much less information in 2D images. It makes it very hard to reconstruct the 3D model. So, it ends up making a lot of mistakes, or not working. Although Bundler and PMVS are awesome compared to previous software, the stuff you can do with kinect is on a whole other level.
To use kinect will only cost you $80 for the kinect off of ebay or $99 off of amazon and another $5 for the power adapter off of amazon. So, I'd highly recommend this route. Kinect provides much more information for the algorithm to work with than 2D images do, making it much more effective, reliable and fast. In fact, it could take hours to process images with Bundler and PMVS. Whereas with kinect, I made a model of my desk in just a few seconds! It truly rocks!

Looking to develop a visual odometer (distance traveled) APP for indoor use

Are there any open source code which will take a video taken indoors (from a smart phone for example of a home or office buildings, hallways) and superimpose that on a 2D picture showing the path traveled? This can be a handr drawn picture or a photo of a floor layout.
First I thought of doing this using the accelerometer and compass sensors but thought that perhaps one can get better accuracy with the visual odometer approach. I only need 0.5 to 1 meter accuracy. The phone will also collect important information indoors (no gps) for superimposing that data on the path traveled (this is the real application of this project and we know how to do this part). The post processing of the video can be done later on a stand alone computer so speed and cpu power is not a issue.
Challenges -
The user will simply hand carry the smart phone so the video taker is moving (walking) and not fixed
limit the video rate to keep the file size small (5 frames/sec? is that ok?). Typically need perhaps a full hour of video
Will using inputs from the phone sensors help the visual approach?
any help or guidance is appreciated Thanks
I have worked in the area for quite some time. There are three points which I'd care to make.
Vision only is hard
Vision based navigation using just a cellphone camera is very difficult. Most of the literature with great results show ~1% distance traveled as state-of-the-art but is usually using stereo cameras. Stereo helps a great deal, particularly in indoor environments for coping with scale drift. I've worked on a system which achieves 0.5% distance traveled for stereo but only roughly 5% distance traveled for monocular. While I can't share code, much of our system was inspired by this Sibley and Mei paper.
Stereo code in our case ran at full 60fps on a desktop. Provided you can push data fast enough, it'll be fine. With your error envelope, you can only navigate for 100m or so. Is that enough?
Multi-sensor is way to go. Though other sensors are worse than vision by themselves.
I've heard some good work with accelerometers mounted on the foot to do ZUPT (zero velocity updates) when the foot is briefly motionless on the ground while taking a step in order to zero out drift. This approach has the clear drawback of needing to mount the device on your foot, making a vision approach largely useless.
Compass is interesting but will be distracted by the ton of metal within an office building. Translating few feet around a large metal cabinet might cause 50+ degrees of directional jump.
Ultimately, a combination of sensors is likely to be the best if you can make that work.
Can you solve a simpler problem?
How much control do you have over your environment? Can you slap down fiducial markers? Can you do wifi triangulation? Does it need to be an initial exploration? If you can go through the environment before hand and produce visual bubbles (akin to Google Street View) to match against, you'll be much more accurate.
I'm not aware of any software that does this directly (though it might exist) but stuff similar to what you want to do has been done. A few pointers:
Google for "Vision based robot localization" the problem you state is very similar to the problem robots with a camera have when they enter a new environment. In this field the approach is usually to have the robot map its environment and then use the model for later reference, but the techniques are similar to what you'll need.
Optical flow will roughly tell you in what direction the camera is moving, but it won't tell you the speed because you have no objective reference. This is because you don't know if the things you see moving in the video feed are 1cm away and very small or 1 mile away and very big.
If you know the camera matrix of the camera recording the images you could try partial 3D scene reconstruction techniques to take a stab at the speed. Note that you can do the 3D scene stuff without the camera matrix (this is the "uncalibrated" part you see in the title of a lot of the google results), the camera matrix will let you add real world object sizes (and hence distances) to your reconstruction.
The amount of images/second you need depends on the speed of the camera. More is better, but my guess is that 5/second should be sufficient at walking speeds.
Using extra sensors will help. Probably the robot localization articles talk about this as well.