I'm developing an object tracking system using Google's Vertex AI AutoML Video Tracking. We currently have an accurate model that identifies objects per frame (as a picture) and I'm exploring models that may be able to gain further insight and accuracy by using a collection of frames (video) for the classification and tracking purposes. I want to learn more about the architecture used in the AutoML Object Tracking, but all I can find is articles hyping up the dynamic nature of the architecture. Mainly, I'm trying to answer the following 3 questions:
What methods does the AutoML Object Tracking use to classify the objects and track them? Are the classifications done frame to frame, with a Euclidean distance tracker mapping objects together? Or are the objects identified and classified across multiple frames a recurrent network in space (image) and time (frame to frame). Something like a LSTM.
What performance can object tracking in AutoML achieve that is better than their image object identification models?
Where can I go to learn more about the model architectures on Vertex AI? It's hard to know which google publications are associated with their current platform.
Any feedback is greatly appreciated!!!
Related
While I have been researching best practices and experimenting multiple options for an ongoing project(i.e. Unity3D iOS project in Vuforia with native integration, extracting frames with AVFoundation then passing the image through cloud-based image recognition), I have come to the conclusion that I would like to use ARkit, Vision Framework, and CoreML; let me explain.
I am wondering how I would be able to capture ARFrames, use the Vision Framework to detect and track a given object using a CoreML model.
Additionally, it would be nice to have a bounding box once the object is recognized with the ability to add an AR object upon a gesture touch but this is something that could be implemented after getting the solid project down.
This is undoubtedly possible, but I am unsure of how to pass the ARFrames to CoreML via Vision for processing.
Any ideas?
Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...
Just about all of the pieces are there for what you want to do... you mostly just need to put them together.
You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)
ARFrame provides the current capturedImage in the form of a CVPixelBuffer.
You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.
You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.
You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)
One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".
Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).
If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.
If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.
If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.
Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.
I have a task where i am asked to track parcels(carton boxes) of different dimensions moving on a conveyor
I am using Asus Xtion pro camera and also i have been asked to use Point cloud data to detect & track the object on the conveyor
I have read so many model-based methods where we need to create a model of the object to be detected and then we perform keypoints extraction,feature mapping and so many other concepts between the scene and the model.
Since i am using Boxes of different dimensions, i definitely need a model of each of them to match with the scene.
My question : Can i have one common point cloud model of a box of some dimension and compare it with any box that comes under the view of a camera? I meant can i have one scalable model to compare with box of any dimension? is it possible?
My chief doesn't want the project to be dependent on many models for detection and tracking. One common model which is scalable or parameterized should do the trick it seems.
Thanks in advance
I have written an object classification program using BoW clustering and SVM classification algorithms. The program runs successfully. Now that I can classify the objects, I want to track them in real time by drawing a bounding rectangle/circle around them. I have researched and came with the following ideas.
1) Use homography by using the train set images from the train data directory. But the problem with this approach is, the train image should be exactly same as the test image. Since I'm not detecting specific objects, the test images are closely related to the train images but not essentially an exact match. In homography we find a known object in a test scene. Please correct me if I am wrong about homography.
2) Use feature tracking. Im planning to extract the features computed by SIFT in the test images which are similar to the train images and then track them by drawing a bounding rectangle/circle. But the issue here is how do I know which features are from the object and which features are from the environment? Is there any member function in SVM class which can return the key points or region of interest used to classify the object?
Thank you
I am trying to develop an automatic(or semi-automatic) image annotator for my final year project with OpenCV. I have been studying many OpenCV resources and have come across cascade classification for training and detection purposes. I understood that part, and also tried the Face Detection tutorial provided with OpenCV. So, now I know how to train and detect objects.
However, I still cannot understand how can I annotate objects present in the image?
For example, the system will show that this is an object, but I want the system to show that it is a ball. How can i accomplish that?
Thanks in advance.
One binary classificator (detector) can separate objects by two classes:
positive - the object type classifier was trained for,
and negative - all others.
If you need detect several distinguished classes you should use one detector for each class, or you can train multiclass classifier ("one vs all" type of classifiers for example), but it usually works slower and with less accuracy (because detector better search for similar objects). You can also take a look at convolutional networks (by Yann LeCun).
This is a very hard task. I suggest simplifying it by using latent SVM detector and limiting yourself to the models it supplies:
http://docs.opencv.org/modules/objdetect/doc/latent_svm.html
What is a best approach for face detection/tracking considering following scenario:
when person enters in scene/frame it should be detected and recognized in every next frame until he/she leaves the scene.
also should be able to do this for multiple users at once.
I have experience with viola jones detection, and fisher face recognition. But I've used ff recognition only for previously prepared learning set, and now I need something for any user that enters the scene..
I am also interested in different solutions.
I used opencv face detection for multiple faces and the rekognition api (http://rekognition.com) and pushed the faces and retrained the dataset frequently. Light-weighted from our side, but I am sure there are more robust solutions for this.
Have you tried VideoSurveillance? Also known as OpenCV blob tracker.
It's a motion-based tracker with across frames data association(1) and if you want to replace motion with face detection, you must adjust the code by replacing the foreground mask with detection responses. This approach is called track-by-detect in the literature.
(1) "Appearance Models for Occlusion Handling", Andrew Senior et al.