Can I / How to implement object recognition in app? - computer-vision

our aim is to develop an automotive app which automates and standardizes photo shooting of newly arrived cars at dealership. Basically, our Mavic 2 Pro takes off, orbits the vehicle and shoots photo every 90 degrees and than lands to its original position. The orbiting radius is app 4.5 m.
Since the scene of shooting is small (or when shooting indoor, GPS might not be available), we would like to rely more on built in object recognition as implemented in activetrack missions. We have currently an app based on waypoint mission, but it is not accurate. So my questions are:
1, Can anyone point us in direction how to implement object recognition in our app?
2, If recognition is not available, how to ensure consistency of output? While testing, sometimes the output exif info of photos show compass deviation up to 4 degrees, which results in object being out of view.
Thanks for advice,
Mirek

What you need is object recognition and tracking like YOLO.
The resources are here.
https://pjreddie.com/darknet/yolo/
Then you need to train the model to recognize cars, so you need a bunch of labeled photos where the car bounding box has been already given correctly.
Downstream you want then have the camera connect to a computer and stream the video, have the tracking algorithm recognize the bonding box of the car and, depending on center and size, maneuver the drone accordingly.

Related

How do ARCore or ARKit produce real-time augmentations of live video?

So a while back about a year ago I was interested in building my own barebones augmented reality (AR) library. My goal was to be able to take a video of something (anything really) and then be able to place augmentations (3D objects that weren't really there) in the video. So for example I might take a video of my living room and then, through this AR library/tool, I'd be able to add in maybe a 3D avatar of a monster sitting on top of my coffee table. So, knowing absolutely nothing about the subject or computer vision in general, I had settled for the following strategy:
Use 3D reconstruction tools/techniques (Structure from Motion, or SfM) to build up a 3D model of everything in the video (e.g. a 3D model of my living room)
Analyze that 3D model (really a 3D pointcloud to be exact) for flat surfaces
Add my own logic to determine what objects (3D models such as Blender files, etc.) to place in what area of the video's 3D model (e.g. monster standing on top of the coffee table)
The hardest part: inferring the camera orientation in each frame of the video, and then figuring out how to orient the augmentation (e.g. monster) correctly based on what the camera is pointed at, and then "merging" the augmentation's 3D model into the main video 3D model. This means that as the camera moves around my living room, the monster appears to remain standing in the same place on my coffee table. I never figured out a good solution for this but figured if I could get to this 4th step that I'd find some solution.
After several difficult weeks (computer vision is hard!) I got the following pipeline of tools to work with mixed success:
I was able to feed sample frames of a video (e.g. a video taken while walking around my living room) into OpenMVG and produce a sparse pointcloud PLY file/model of it
Then I was able to feed that PLY file into MVE and produce a dense pointcloud (again PLY file) of it
Then I fed the dense pointcloud and the original frames into mvs-texturing to produce a textured 3D model of my video
About 30% of the time, this pipeline worked amazing! Here's the model of the front of my house. You can see my 3D front yard, my son's 3D playhouse and even kinda/sorta make out windows and doors!
About 70% of the time the pipelined failed with indecipherable errors, or produced something that looked like an abstract painting. Additionally, even with automated scripting involved, it took the tooling about 30 mins to produce the final 3D textured model...so pretty slow.
Well, looks like Google ARCode and Apple ARKit beat me to the punch! These frameworks can take live video feeds from your smartphone and accomplish exactly what I had been trying to accomplish about a year ago: real-time 3D AR. Very, very similar (but much more advanced & interactive) as Pokemon Go. Take a video of your living room, and voila, an animated monster is sitting on your coffee table, and you can interact with it. Very, very, very cool stuff.
My question
I'm jealous! Of course, Google and Apple can hire some best-in-show CV/3D recon folks, but I'm still jealous!!! I'm curious if there are any hardcore AR/CV/3D recon gurus out there that either have insider knowledge or just know the AR landscape so well that they can speak to what kind of tooling/pipeline/technology is going on behind the scenes here with ARCode or ARKit. Because I practically broke my brain trying to figure this out on my own, and I failed spectacularly.
Was my strategy (explained above) ballpark-accurate, or way off base? (Again: 3D recon of video -> surface analysis -> frame-by-frame camera analysis, model merge)?
What kind of tooling/libraries/techniques are at play here?
How do they accomplish this in real-time whereas, if my 3D recon even worked, it took 30+ mins to be processed & generated?
Thanks in advance!
I understand your jealousy and as a Computer Vision engineer I have it experienced many times before :-).
The key for AR on mobile devices is the fusion of computer vision and inertial tracking (the phone's gyroscope).
Quote from Apple's ARKit docu:
ARKit uses a technique called visual-inertial odometry. This process
combines information from the iOS device’s motion sensing hardware
with computer vision analysis of the scene visible to the device’s
camera.
Quote from Google's ARCore docu:
The visual information is combined with inertial measurements from the
device's IMU to estimate the pose (position and orientation) of the
camera relative to the world over time.
The problem with this approach is that you have to know every single detail about your camera and IMU sensor. They have to be calibrated and synced together. No wonder it is easier for Apple than for the common developer. And this is also the reason why Google only supports a couple of phones for the ARCore preview.

Stitching a full spherical mosaic using only a smartphone and sensor data?

I'm really interested in the Google Street View mobile application, which integrates a method to create a fully functional spherical panorama using only your smartphone camera. (Here's the procedure for anyone interested: https://www.youtube.com/watch?v=NPs3eIiWRaw)
What strikes me the most is that it always manages to create the full sphere, even when stitching a feature-less near monochrome blue sky or ceiling ; which gets me to thinking that they're not using feature based matching.
Is it possible to get a decent quality full spherical mosaic without using feature based matching and only using sensor data? Are smartphone sensors precise enough? What library would be usable to do this? OpenCV? Something else?
Thanks!
The features are needed for registration. In the app the clever UI makes sure they already know where each photo is relative to the sphere so in the extreme case all the have to do is reproject/warp and blend. No additional geometry processing needed.
I would assume that they do try to do some small corrections to improve the registration, but even if these fail, you can fallback onto the sensor based ones acquired at capture time.
This is a case where a clever UI makes the vision problem significantly easier.

Vision Framework with ARkit and CoreML

While I have been researching best practices and experimenting multiple options for an ongoing project(i.e. Unity3D iOS project in Vuforia with native integration, extracting frames with AVFoundation then passing the image through cloud-based image recognition), I have come to the conclusion that I would like to use ARkit, Vision Framework, and CoreML; let me explain.
I am wondering how I would be able to capture ARFrames, use the Vision Framework to detect and track a given object using a CoreML model.
Additionally, it would be nice to have a bounding box once the object is recognized with the ability to add an AR object upon a gesture touch but this is something that could be implemented after getting the solid project down.
This is undoubtedly possible, but I am unsure of how to pass the ARFrames to CoreML via Vision for processing.
Any ideas?
Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...
Just about all of the pieces are there for what you want to do... you mostly just need to put them together.
You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)
ARFrame provides the current capturedImage in the form of a CVPixelBuffer.
You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.
You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.
You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)
One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".
Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).
If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.
If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.
If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.
Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.

face tracking or "dynamic recognition"

What is a best approach for face detection/tracking considering following scenario:
when person enters in scene/frame it should be detected and recognized in every next frame until he/she leaves the scene.
also should be able to do this for multiple users at once.
I have experience with viola jones detection, and fisher face recognition. But I've used ff recognition only for previously prepared learning set, and now I need something for any user that enters the scene..
I am also interested in different solutions.
I used opencv face detection for multiple faces and the rekognition api (http://rekognition.com) and pushed the faces and retrained the dataset frequently. Light-weighted from our side, but I am sure there are more robust solutions for this.
Have you tried VideoSurveillance? Also known as OpenCV blob tracker.
It's a motion-based tracker with across frames data association(1) and if you want to replace motion with face detection, you must adjust the code by replacing the foreground mask with detection responses. This approach is called track-by-detect in the literature.
(1) "Appearance Models for Occlusion Handling", Andrew Senior et al.

Looking to develop a visual odometer (distance traveled) APP for indoor use

Are there any open source code which will take a video taken indoors (from a smart phone for example of a home or office buildings, hallways) and superimpose that on a 2D picture showing the path traveled? This can be a handr drawn picture or a photo of a floor layout.
First I thought of doing this using the accelerometer and compass sensors but thought that perhaps one can get better accuracy with the visual odometer approach. I only need 0.5 to 1 meter accuracy. The phone will also collect important information indoors (no gps) for superimposing that data on the path traveled (this is the real application of this project and we know how to do this part). The post processing of the video can be done later on a stand alone computer so speed and cpu power is not a issue.
Challenges -
The user will simply hand carry the smart phone so the video taker is moving (walking) and not fixed
limit the video rate to keep the file size small (5 frames/sec? is that ok?). Typically need perhaps a full hour of video
Will using inputs from the phone sensors help the visual approach?
any help or guidance is appreciated Thanks
I have worked in the area for quite some time. There are three points which I'd care to make.
Vision only is hard
Vision based navigation using just a cellphone camera is very difficult. Most of the literature with great results show ~1% distance traveled as state-of-the-art but is usually using stereo cameras. Stereo helps a great deal, particularly in indoor environments for coping with scale drift. I've worked on a system which achieves 0.5% distance traveled for stereo but only roughly 5% distance traveled for monocular. While I can't share code, much of our system was inspired by this Sibley and Mei paper.
Stereo code in our case ran at full 60fps on a desktop. Provided you can push data fast enough, it'll be fine. With your error envelope, you can only navigate for 100m or so. Is that enough?
Multi-sensor is way to go. Though other sensors are worse than vision by themselves.
I've heard some good work with accelerometers mounted on the foot to do ZUPT (zero velocity updates) when the foot is briefly motionless on the ground while taking a step in order to zero out drift. This approach has the clear drawback of needing to mount the device on your foot, making a vision approach largely useless.
Compass is interesting but will be distracted by the ton of metal within an office building. Translating few feet around a large metal cabinet might cause 50+ degrees of directional jump.
Ultimately, a combination of sensors is likely to be the best if you can make that work.
Can you solve a simpler problem?
How much control do you have over your environment? Can you slap down fiducial markers? Can you do wifi triangulation? Does it need to be an initial exploration? If you can go through the environment before hand and produce visual bubbles (akin to Google Street View) to match against, you'll be much more accurate.
I'm not aware of any software that does this directly (though it might exist) but stuff similar to what you want to do has been done. A few pointers:
Google for "Vision based robot localization" the problem you state is very similar to the problem robots with a camera have when they enter a new environment. In this field the approach is usually to have the robot map its environment and then use the model for later reference, but the techniques are similar to what you'll need.
Optical flow will roughly tell you in what direction the camera is moving, but it won't tell you the speed because you have no objective reference. This is because you don't know if the things you see moving in the video feed are 1cm away and very small or 1 mile away and very big.
If you know the camera matrix of the camera recording the images you could try partial 3D scene reconstruction techniques to take a stab at the speed. Note that you can do the 3D scene stuff without the camera matrix (this is the "uncalibrated" part you see in the title of a lot of the google results), the camera matrix will let you add real world object sizes (and hence distances) to your reconstruction.
The amount of images/second you need depends on the speed of the camera. More is better, but my guess is that 5/second should be sufficient at walking speeds.
Using extra sensors will help. Probably the robot localization articles talk about this as well.