Vision Framework with ARkit and CoreML - computer-vision

While I have been researching best practices and experimenting multiple options for an ongoing project(i.e. Unity3D iOS project in Vuforia with native integration, extracting frames with AVFoundation then passing the image through cloud-based image recognition), I have come to the conclusion that I would like to use ARkit, Vision Framework, and CoreML; let me explain.
I am wondering how I would be able to capture ARFrames, use the Vision Framework to detect and track a given object using a CoreML model.
Additionally, it would be nice to have a bounding box once the object is recognized with the ability to add an AR object upon a gesture touch but this is something that could be implemented after getting the solid project down.
This is undoubtedly possible, but I am unsure of how to pass the ARFrames to CoreML via Vision for processing.
Any ideas?

Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...
Just about all of the pieces are there for what you want to do... you mostly just need to put them together.
You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)
ARFrame provides the current capturedImage in the form of a CVPixelBuffer.
You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.
You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.
You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)
One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".
Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).
If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.
If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.
If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.
Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.

Related

Stitching a full spherical mosaic using only a smartphone and sensor data?

I'm really interested in the Google Street View mobile application, which integrates a method to create a fully functional spherical panorama using only your smartphone camera. (Here's the procedure for anyone interested: https://www.youtube.com/watch?v=NPs3eIiWRaw)
What strikes me the most is that it always manages to create the full sphere, even when stitching a feature-less near monochrome blue sky or ceiling ; which gets me to thinking that they're not using feature based matching.
Is it possible to get a decent quality full spherical mosaic without using feature based matching and only using sensor data? Are smartphone sensors precise enough? What library would be usable to do this? OpenCV? Something else?
Thanks!
The features are needed for registration. In the app the clever UI makes sure they already know where each photo is relative to the sphere so in the extreme case all the have to do is reproject/warp and blend. No additional geometry processing needed.
I would assume that they do try to do some small corrections to improve the registration, but even if these fail, you can fallback onto the sensor based ones acquired at capture time.
This is a case where a clever UI makes the vision problem significantly easier.

Using OpenCV to touch and select object

I'm using the OpenCV framework in iOS Xcode objc, is there a way that I could process the image feed from the video camera and allow the user to touch an object on the screen then we use some functionality in OpenCV to highlight it.
Here is graphically what I mean. The first image shows an example of what the user might see in the video feed:
Then when they tap on the screen on the ipad i want to use OpenCV feature/object detecting to process the area they've clicked to highlight the area. Would look something like this if they clicked the ipad:
Any ideas on how this would be achievable in objc OpenCV?
I can see quite easily how we could achieve this using trained templates of the iPad to match it using OpenCV algorithms but I want to try and get it dynamic so users can just touch anything in the screen and we'll take it from there?
Explanation: why should we use the segmentation approach
According to my understanding, the task which you are trying to solve is segmentation of objects, regardless to their identity.
The object recognition approach is one way to do it. But it has two major downsides:
It requires you to train an object classifier, and to collect a dataset which contains a respectable amount of examples of objects which you would like to recognize. If you choose to take a classifier which is already trained - it won'y necessarily work on any type of object which you would like to detect.
Most of the object recognition solutions find a bounding box around the recognized object, but they don't perform a complete segmentation of it. The segmentation part requires extra effort.
Therefore, I believe that the best way for your case is to use an image segmentation algorithms. More precisly, we'll be using the GrabCut segmentation algorithm.
The GrabCut algorithm
This is an iterative algorithm with two stages:
initial stage - the user specify a bounding box around the object.
given this bounding box the algorithm estimates the color distribution of foreground (the object) and the background by using GMM, followed by a graph cut optimization for finding the optimal boundaries between the foreground and the background.
In the next stage, the user may correct the segmentation if needed, by supplying scribbles of the foreground and the background. The algorithm fixes the model accordingly and perform a new segmentation based on the updated information.
Using this approach has pros and cons.
The pros:
The segmentation algorithm is easy to implement with openCV.
It enables the user to fix segmentation errors if needed.
It doesn't relies on a collecting a dataset and training a classifier.
The main con is that you will need an extra source of information from the user beside of a single tap on the screen. This information will be a bounding box around the object, and in some cases - additional scribbles will be required to correct the segmentation.
Code
Luckily, there is an implementation of this algorithm in OpenCV. The user Itseez create a simple and easy to use sample for using OpenCV's GrabCut algorithm, which can be found here: https://github.com/Itseez/opencv/blob/master/samples/cpp/grabcut.cpp
Application usage:
The application receives a path to an image file as an command line argument input. It renders the image onto the screen and the user is required to supply an initial bounding rect.
The user can press 'n' in order to perform the segmentation for the current iteration or press 'r' to revert his operation.
After choosing a rect, the segmentation is calculated. If the user wants to correct it, he may choose to add foreground or background scribbles by pressing shift+left and Ctrl+left accordingly.
Examples
Segmenting the iPod:
Segmenting the pen:
You Can do it by Training a Classifier of Ipad images using opencv Haar Classifiers and then detecting Ipad images in a given frame.
Now based on coordinates of the touch check if that area overlapped with detected Ipad image area. If it does Drawbounding box on the detected Object.Means from there on you can proceed towards processing your detected ipad image.
Repeat the above procedure for Number of objects that you want to detect.
The task which you are trying to solve is "Object proposal". It doesn't work very accurate and this results are very new.
This two articles give you a good overview of methods for this:
https://pdollar.wordpress.com/2013/12/10/a-seismic-shift-in-object-detection/
https://pdollar.wordpress.com/2013/12/22/generating-object-proposals/
To have state-of-the-art results, look for latest CVPR papers on Object proposals. Quite often they have code available to test.

OpenCV to Identify objects from training video set and then test them against another video

I have been tasked to use OpenCV and C++
Read a set of videos for creating a set of images/learning.
Classify objects seen in the videos
Label the images
test against series of test videos to check objects were identified as expected. draw a rectangle around them and label.
I am new to OpenCV however happy to program in C++ as soon as approach is formed. I am also planning to write my own functions at a later stage.
I need your help in formning right way of solution approach as I have to identify household objects [cup, soft toy, phone, camera, keyboard) from a stream of video and then test on another stream of video. The original video has depth information as well but not sure how to use it to my benefit.
Read about Support vector machine (SVM) , Feature extraction (e.g. SIFT/SURF) , SVM training and SVM testing. And, for drawing Rectangle, read about findContour(), drawContour() in openCV.
Approach:
Detect objects (e.g. car/plane etc.). Store the points of its contours
Extract some features of that object using SIFT/SURF
Based upon the extracted features, classify the object using SVM (the input for SVM will be the extracted features)
And if the SVM says -Yes! it is a car. Then, draw a rectangle around it using the points of its contour which you had stored in first step.

How to make motion history image for presentation into one single image?

I am working on a project with gesture recognition. Now I want to prepare a presentation in which I can only show images. I have a series of images defining a gesture, and I want to show them in a single image just like motion history images are shown in literature.
My question is simple, which functions in opencv can I use to make a motion history image using lets say 10 or more images defining the motion of hand.
As an example I have the following image, and I want to show hand's location (opacity directly dependent on time reference).
I tried using GIMP to merge layers with different opacity to do the same thing, however the output is not good.
You could use cv::updateMotionHistory
Actually OpenCV also demonstrates the usage in samples/c/motempl.c

Design of virtual trial room

As a part of my masters project I proposed to build a virtual trial room application intended for retail clothing stores. Currently its meant to be used directly in store though it may be extended for online stores as well.
This application will show customers how a selected apparel would look on them by showing it on their 3D replica on screen.
It involves 3 steps
Sizing up the customer
Building customer replica 3D humanoid model
Apply simulated cloth on the model
My question is about the feasibility of the project and choice of framework.
Can this be achieved in real time using a normal Desktop computer? If yes what would be appropriate framework ( hardware, software, programming language etc ) for this purpose?
On the work I have done till now, I was planning to achieve above steps in following ways
for step 1 : option a) Two cameras for front and side views or
option b) 1 Kinect or 2 Kinect for complete 3D data
for step 2: either use makehuman (http://www.makehuman.org/) code to build a customised 3D model using above data or build everything from scratch, unsure about the framework.
for step 3: Just need few cloth samples, so thought of building simulated clothes in blender.
Currently I have just the vague idea about different pieces but I am not sure of how to develop complete application.
Theoretically this can be achieved in real time. Many usefull algorithms for video tracking, stereo vision and 3d recostruction are available in OpenCV library. But it's very difficult to build robust solution. For example, you'll probably need to track human body which moves frame to frame and perform pose estimation (OpenCV contains POSIT algorithm), however it's not trivial to eliminate noise in resulting objects coordinates. For inspiration see a nice work on video tracking.
You might want to choose another way, simplify some things, avoid complicated stuff do things less dynamicaly and estimate only clothes size and approximate human location. I this case most likely you will create something usefull and interesting.
I've lost link to one online fiting room where hands and body detection implemented. Using Kinnect solves many problems. But If for some reason you won't use it then AR(augmented reality) helps you (yet another fitting room)