Object recognition of a set of objects - computer-vision

In a computer vision project, the image I want to process can be partitioned in "zones" containining multiple products of the same kind.
Provided that I can retrieve image information of all the possible kinds of product, I need to detect which kind is present in each zone, without the need to detect the position of each single product. In summary, I need to recognize "sets of products".
As additional info, the products have not a rigid shape, they are not oriented in the same manner and luminosity changes (so I am basically searching for shape, orientation and luminosity invariant approaches).
The reliable info I can exploit is that the products logos - or parts of them - are often visible and the products are quite colorful.
I would like to know about possible approaches that exploit the fact that I know the zones partition and approaches that do not exploit it.

Related

Quick and Dirty Image Registration Tool

Given two images, e.g. two cats, is there a library that includes a "quick and dirty" way of telling by how much the two images differ regarding translation and rotation? Image registration is a big field and every application I run into seems to be tailored to medical scans and usually has certain domain specific caps on the transformation ranges. The tool I require should take two images as an input and return an angle of rotation and a translation vector, maybe even a confidence metric, it's that simple. (Most algorithms out there are heavy-duty and focus on minute details for alignment, the tool I'm looking for need not be as exact.)
If it needs not be very precise, you can probably tweak the code from PyImageSearch to better suit your application.
If you know that the two images you are going to compare do contain the same object (i.e., if there is no additional object recognition problem that comes before this step), then you can maybe try using the ORB detector to find the good keypoints, and then estimate the homography using ViSP

3D Object Detection & Tracking using PCL

I have a task where i am asked to track parcels(carton boxes) of different dimensions moving on a conveyor
I am using Asus Xtion pro camera and also i have been asked to use Point cloud data to detect & track the object on the conveyor
I have read so many model-based methods where we need to create a model of the object to be detected and then we perform keypoints extraction,feature mapping and so many other concepts between the scene and the model.
Since i am using Boxes of different dimensions, i definitely need a model of each of them to match with the scene.
My question : Can i have one common point cloud model of a box of some dimension and compare it with any box that comes under the view of a camera? I meant can i have one scalable model to compare with box of any dimension? is it possible?
My chief doesn't want the project to be dependent on many models for detection and tracking. One common model which is scalable or parameterized should do the trick it seems.
Thanks in advance

Neural network topology for object recognition on aerial photos (computer vision)

My objective is to recognize the footprints of buildings on aerial photos. Having heard about recent progress in machine vision (ImageNet Large Scale Visual Recognition Challenges) I though I could (at least) try to use neural networks for this task.
Can anybody give me the idea what should be the topology of such a network? I guess it should have as many outputs as inputs (which means all the pixels in picture) since I want to recognize the outlines of buildings with their (at least approximate) placement on the picture.
I guess the input pictures should be of standard size, with each pixel normalized to grey scale or YUV color space (1 value per color) and maybe normalized resolution (each pixel should represent fixed size in reality). I am not sure if the picture could be preprocessed in any other way before inputting into net, maybe by extracting the edges first?
The tricky part is how the outputs should be represented and how to train the net. Using just e.g. output=0 for the pixel within building footprint and 1 for the pixel outside of it, might not be the best idea. Maybe I should teach the network to recognize edges of the building instead so the pixels which represent building edges should have 1's and 0's for the rest of pixels?
Can anybody throw in some suggestions about network topology/inputs/outputs formats?
Or maybe this task is hopelessly difficult and I have 0 chances to solve it?
I think we need a better definition of "buildings". If you want to do building "detection", that is detect the presence of a building of any shape/size, this is difficult for a cascade classifier. You can try the following, though:
Partition a set of known images to fixed-size blocks.
Label each block as "building", "not building", or
"boundary(includes portions
of both)"
Extract basic features like intensity histograms, edges,
hough lines, HOG, etc.
Train SVM classifiers based on these features (you can try others, too, but I recommend SVM by experience).
Now you can partition your images again and use the trained classifier to get the results. The results will have to be combined to identify buildings.
This will still need some testing to get the parameters(size of histograms, parameters of SVM classifier etc.) right.
I have used this approach to detect "food" regions on images. The accuracy was below 70%, but my guess is that it will be better for buildings.

Design of virtual trial room

As a part of my masters project I proposed to build a virtual trial room application intended for retail clothing stores. Currently its meant to be used directly in store though it may be extended for online stores as well.
This application will show customers how a selected apparel would look on them by showing it on their 3D replica on screen.
It involves 3 steps
Sizing up the customer
Building customer replica 3D humanoid model
Apply simulated cloth on the model
My question is about the feasibility of the project and choice of framework.
Can this be achieved in real time using a normal Desktop computer? If yes what would be appropriate framework ( hardware, software, programming language etc ) for this purpose?
On the work I have done till now, I was planning to achieve above steps in following ways
for step 1 : option a) Two cameras for front and side views or
option b) 1 Kinect or 2 Kinect for complete 3D data
for step 2: either use makehuman (http://www.makehuman.org/) code to build a customised 3D model using above data or build everything from scratch, unsure about the framework.
for step 3: Just need few cloth samples, so thought of building simulated clothes in blender.
Currently I have just the vague idea about different pieces but I am not sure of how to develop complete application.
Theoretically this can be achieved in real time. Many usefull algorithms for video tracking, stereo vision and 3d recostruction are available in OpenCV library. But it's very difficult to build robust solution. For example, you'll probably need to track human body which moves frame to frame and perform pose estimation (OpenCV contains POSIT algorithm), however it's not trivial to eliminate noise in resulting objects coordinates. For inspiration see a nice work on video tracking.
You might want to choose another way, simplify some things, avoid complicated stuff do things less dynamicaly and estimate only clothes size and approximate human location. I this case most likely you will create something usefull and interesting.
I've lost link to one online fiting room where hands and body detection implemented. Using Kinnect solves many problems. But If for some reason you won't use it then AR(augmented reality) helps you (yet another fitting room)

Comparing SIFT features stored in a mysql database

I'm currently extending an image library used to categorize images and i want to find duplicate images, transformed images, and images that contain or are contained in other images.
I have tested the SIFT implementation from OpenCV and it works very well but would be rather slow for multiple images. Too speed it up I thought I could extract the features and save them in a database as a lot of other image related meta data is already being held there.
What would be the fastest way to compare the features of a new images to the features in the database?
Usually comparison is done calculating the euclidean distance using kd-trees, FLANN, or with the Pyramid Match Kernel that I found in another thread here on SO, but haven't looked much into yet.
Since I don't know of a way to save and search a kd-tree in a database efficiently, I'm currently only seeing three options:
* Let MySQL calculate the euclidean distance to every feature in the database, although I'm sure that that will take an unreasonable time for more than a few images.
* Load the entire dataset into memory at the beginning and build the kd-tree(s). This would probably be fast, but very memory intensive. Plus all the data would need to be transferred from the database.
* Saving the generated trees into the database and loading all of them, would be the fastest method but also generate high amounts of traffic as with new images the kd-trees would have to be rebuilt and send to the server.
I'm using the SIFT implementation of OpenCV, but I'm not dead set on it. If there is a feature extractor more suitable for this task (and roughly equally robust) I'm glad if someone could suggest one.
So I basically did something very similar to this a few years ago. The algorithm you want to look into was proposed a few years ago by David Nister, the paper is: "Scalable Recognition with a Vocabulary Tree". They pretty much have an exact solution to your problem that can scale to millions of images.
Here is a link to the abstract, you can find a download link by googleing the title.
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1641018
The basic idea is to build a tree with a hierarchical k-means algorithm to model the features and then leverage the sparse distribution of features in that tree to quickly find your nearest neighbors... or something like that, it's been a few years since I worked on it. You can find a powerpoint presentation on the authors webpage here: http://www.vis.uky.edu/~dnister/Publications/publications.html
A few other notes:
I wouldn't bother with the pyramid match kernel, it's really more for improving object recognition than duplicate/transformed image detection.
I would not store any of this feature stuff in an SQL database. Depending on your application it is sometimes more effective to compute your features on the fly since their size can exceed the original image size when computed densely. Histograms of features or pointers to nodes in a vocabulary tree are much more efficient.
SQL databases are not designed for doing massive floating point vector calculations. You can store things in your database, but don't use it as a tool for computation. I tried this once with SQLite and it ended very badly.
If you decide to implement this, read the paper in detail and keep a copy handy while implementing it, as there are many minor details that are very important to making the algorithm work efficiently.
The key, I think, is that is this isn't a SIFT question. It is a question about approximate nearest neighbor search. Like image matching this too is an open research problem. You can try googling "approximate nearest neighbor search" and see what type of methods are available. If you need exact results, try: "exact nearest neighbor search".
The performace of all these geometric data structures (such as kd-trees) degrade as the number of dimensions increase, so the key I think is that you may need to represent your SIFT descriptors in a lower number of dimensions (say 10-30 instead of 256-1024) to have really efficient nearest neighbor searches (use PCA for example).
Once you have this I think it will become secondary if the data is stored in MySQL or not.
I think speed is not the main issue here. The main issue is how to use the features to get the results you want.
If you want to categorize the images (e. g. person, car, house, cat), then the Pyramid Match kernel is definitely worth looking at. It is actually a histogram of the local feature descriptors, so there is no need to compare individual features to each other. There is also a class of algorithms known as the "bag of words", which try to cluster the local features to form a "visual vocabulary". Again, in this case once you have your "visual words" you do not need to compute distances between all pairs of SIFT descriptors, but instead determine which cluster each feature belongs to. On the other hand, if you want to get point correspondences between pairs of images, such as to decide whether one image is contained in another, or to compute the transformation between the images, then you do need to find the exact nearest neighbors.
Also, there are local features other than SIFT. For example SURF are features similar to SIFT, but they are faster to extract, and they have been shown to perform better for certain tasks.
If all you want to do is to find duplicates, you can speed up your search considerably by using a global image descriptor, such as a color histogram, to prune out images that are obviously different. Comparing two color histograms is orders of magnitude faster than comparing two sets each containing hundreds of SIFT features. You can create a short list of candidates using color histograms, and then refine your search using SIFT.
I have some tools in python you can play with here . Basically its a package that uses SIFT transformed vectors, and then computes a nearest lattice hashing of each 128d sift vector. The hashing is the important part, as it is locality sensitive, simply meaning that vectors near in R^n space result in equivalent hash collision probabilities. The work I provide is an extension of Andoni that provides a query adaptive heuristic for pruning the LSH exact search lists, as well as an optimized CUDA implementation of the hashing function. I also have a small app that does image database search with nice visual feedback, all under bsd (exception is SIFT which has some additional restrictions).