Object Detection: Training Requried or No Training Required? - c++

This question is related to Object detection, and basically, detecting any "known" object. For an example, imagine I have the below objects.
Table
Bottle.
Camera
Car
I will take 4 photos from all of these individual object. One from left, another from right, and other 2 from up and down. I originally thought it is possible to recognize these objects with these 4 photos per each, because you have the photos in all 4 angles, no matter how you see the object you can detect it.
But I got confused with someones idea about training the engine with thousands of positive and negative images from each object. I really don't think this is required.
So simply speaking, my question is, in order to identify an object, do I need these thousands of positive and negative objects? Or else simply 4 photos from 4 angles is enough?
I am expecting to use OpenCV for this.
Update
Actually the main thing is something like this.. Imagine that I have 2 laptops. One is Dell and the other one is HP. Both are laptops but you know, they have clearly visible differences including the Logo. Can we do this using Feature Description? If not, how "hard" the "training" process? How many pics needed?
Update 2
I need to detect "specific" objects. Not all the cars, all the bottles etc. For an example, the "Maruti Car Model 123" and "Ferrari Car Model 234" are both cars but different. Imagine I have the pictures of Maruti and Ferrari of above mentioned models, then I need to detect them. I don't have to worry about other cars or vehicles, or even other models of Maruti and Ferrari. But the above mentioned "Maruti Car Model 123" should be identified as "Maruti Car Model 123" and above mentioned "Ferrari Car Model 234"should be identified as "Ferrari Car Model 234". How many pictures do I need for this?

Answers:
If you want to detect a specific object and you don't need to account for view point changes, you can use 2D features:
http://docs.opencv.org/doc/tutorials/features2d/feature_homography/feature_homography.html
To distinguish between 2 logos, you'll probably need to build a detector for each logo which will be trained on a set of images. For example, you can train a Haar cascade classifier.
To distinguish between different models of cars, you'll probably need to train a classifier using training images of each car. However, I encountered an application which does that using a nearest neighbor approach - it just extracts features from the given test image and compares it to a know set of images of difference car models.
Also, I can recommend some approaches and packages if you'll explain more on the application.

To answer the question you asked in the title, if you want to be able to determine what the object in the picture is you need a supervised algorithm (a.k.a. trained). Otherwise you would be able to determine, in some cases, the edges or the presence of an object, but not what kind of an object it is. In order to tell what the object is you need a labelled training set.
Regarding the contents of the question, the number of possible angles in a picture of an object is infinite. If you just have four pictures in your training set, the test example could be taken in an angle that falls halfway between training example A and training example B, making it hard to recognize for your algorithm. The larger the training set the higher the probability of recognizing the object. Be careful: you never reach the absolute certainty that your algorithm will recognize the object. It just becomes more likely.

Related

How to extract a size of an object from a image

I'm attempting to assemble a pipeline that can predict the size of a object given an image i.e predict the size in unit X given a photo of a apple.
From what I have looked at there are not any training sets that have a calibration object that have been training alongside; my actual problem is to extract the nutritional information (roughly speaking) from a food group - once I know the foods w*h or bounding box area I can then have everything I need to accurately count calories.
As mentioned above lightly there is only one way of doing this that I am aware of and that's training a classifier with a calibration object; so images could be taken here and grouped into a class for the object that you're calibrating such as a Dime or Penny, credit card etc. something with static dimensions.
What is throwing me off a little bit is if there is a better way of doing things and if there are please share additionally, if there are any pre-trained models that have been calibrated using a static object that could be used to estimate the size of a object given any image
Based on the definition of your problem, I don't think you should train your classifier to predict the size of the object. You might want to train NN to identify some common object (dime, penny, CC etc) and train another network to identify foods. Once you put your "calibration" object near the plate, you can easily normalize the image and calculate food's size.

2D object detection with only a single training image

The vision system is given a single training image (e.g. a piece of 2D artwork ) and it is asked whether the piece of artwork is present in the newly captured photos. The newly captured photos can contain a lot of any other object and when the artwork is presented, it must face up but may be occluded.
The pose space is x,y,rotation and scale. The artwork may be highly symmetric or not.
What is the latest state of the art handling this kind of problem?
I have tried/considered the following options but there are some problems in all of them. If my argument is invalid please correct me.
deep learning(rcnn/yolo): a lot of labeled data are needed means a lot of human labor is needed for each new pieces of artwork.
traditional machine learning(SVM,Random forest): same as above
sift/surf/orb + ransac or voting: when the artwork is symmetric, the features matched are mostly incorrect. A lot of time is needed in the ransac/voting stage.
generalized hough transform: the state space is too large for the voting table. Pyramid can be applied but it is difficult to choose some universal thresholds for different kinds of artwork to proceed down the pyramid.
chamfer matching: the state space is too large. Too much time is needed in searching across the state space.
Object detection requires a lot of labeled data of the same class to generalize well, and in your setting it would be impossible to train a network with only single instance.
I assume that in your case online object trackers can work, at least give it a try. There are some convolutional object trackers that work great like Siamese CNNs. The code is open source at github, and you can watch this video to see its performance.
Online object tracking: Given the initialized state (e.g., position
and size) of a target object in a frame of a video, the goal
of tracking is to estimate the states of the target in the subsequent
frames.-source-
You can try using traditional feature based image processing algorithm which might give true positive template matches up to a descent accuracy.
Given template image as in the question:
First dilate the image to join all very closely spaced connected
components.
Find the convex hull of the connected object obtained above,This will give you a polygon.
Use above polygon edge length information like (max-length/min-length) ratio as feature of the template.
Also find the pixel density in the polygon as second feature.
We have 2 features now.
Scene image feature vector:
Similarly Again in the scene image use dilation followed by connected components identification, define convex hull(polygon) around each connected objects and define feature vector for each object(edge info, pixel density).
Now as usual search for template feature vector in the scene image feature vectors data with minimum feature distance(also use certain upper level distance threshold value to avoid false positive matches).
This should give the true positive matches if available in the scene image.
Exception: This method would not work for occluded objects.

OpenCV - Detection of moving object C++

I am working on Traffic Surveillance System an OpenCv project, I need to detect moving cars and people. I am using background subtraction method to detect moving objects and thus drawing counters.
I have a problem :
When two car are moving on road closely them my system detects it as one car, I have used all efforts like canny-edge detection, transformation etc. Can anyone tell me any particular methodology to solve this type of problems.
Plenty of solutions are possible.
A geometric approach would detect that the one moving blob is too big to be a single passenger car. Still, this may indicate a car with a caravan. That leads us to another question: if you have two blobs moving close together, how do you know it's two cars and not one car towing a caravan? You may need to add some elementary shape detection.
Another trivial approach is to observe that cars do not suddenly multiply. If you have 5 video frames, and in 4 of them you spot two cars, then it's very very likely that the 5th frame also has two cars.
CV system tracks object as moving blobs (“clouds” of moving pixels) identifies them and distinct one from another in case of occlusions. When two (or more) blobs are intersected, system merges them in one combined object and marks it by IDs of all those source-objects that currently included in the combination. When one of objects separates from the combination CV system recognize which one is out and re-arrange ID appropriately.

Identifying blobs in image as that of a vehicle

Any idea how I can get the smaller blobs belonging to the same vehicle count as 1 vehicle? Due to background subtraction, in the foreground mask, some of the blobs belonging to a vehicle are quite small, and hence filtering the blobs based on their size won't work.
Try filtering things based on colorDistance() and the comparing the mean color of the blobs in the image with the vehicle against a control image of the background without the car in it. The SimpleCV docs have a tutorial specifically on this topic. That said... it may not always work as expected. Another possibility (just occurred to me) might be summing up the area of the blobs of interest and seeing if that sum is over a given thresh-hold, rather than just any one blob itself.

What is `query` and `train` in openCV features2D

Everywhere in features2D classes I see terms query and train. For example matches have trainIdx and queryIdx, and Matchers have train() method.
I know the definition of words train and query in English, but I can't understand the meaning of this properties or methods.
P.S. I understand, that it's very silly question, but maybe it's because English is not my native language.
To complete sansuiso's answer, I suppose the reason for choosing these names should be that in some application we have got a set of images (training images) beforehand, for example 10 images taken inside your office. The features can be extracted and the feature descriptors can be computed for these images. And at run-time an image is given to the system to query the trained database. Hence the query image refers to this image. I really don't like the way they have named these parameters. Where you have a pair of stereo images and you want to match the features, these names don't make sense but you have to chose a convention say always call the left image the query image and the right image as the training image. I did my PhD in computer vision and some naming conventions in OpenCV seem really confusing/silly to me. So if you find these confusing or silly you're not alone.
train: this function builds the classifier inner state in order to make it operational. For example, think of training an SVM, or building a kd-tree from the reference data. Maybe you are confused because this step is often referred to as learning in the literature.
query is the action of finding the nearest neighbors to a set of points, and by extension it also refers to the whole set of points for which yo want a nearest neighbor. Recall that you can ask for the neighbors of 1 point, or a whole lot in the same function call (by stacking the feature points in a matrix).
trainIdxand queryIdx refer to the index of a pint in the reference / query set respectively, i.e. you ask the matcher for the nearest point (stored at the trainIdx position) to some other point (stored at the queryIdxposition). Of course, trainIdxis known after the function call. If your points are stored in a matrix, the index will be the line of the considered feature.
I understand "query" and "train" in a very naive but useful way:
"train": a data or image is preprocessed to get a database
"query": an input data or image that will be queried in the database which we trained before.
Hope it helps u as well.