How to get a position of custom object on image using vision recognition api - computer-vision

I know there is a lot of vision recognition APIs such as Clarifai, Watson, Google Cloud Vision, Microsoft Cognitive Services which provide recognition of image content. The response of these services is simple json that contains different tags, for example
{
man: 0.9969295263290405,
portrait: 0.9949591159820557,
face: 0.9261120557785034
}
The problem is that I need to know not only what is on the image but also the position of that object. Some of those APIs have such feature but only for face detection.
So does anyone know if there is such API or I need to train own haar cascades on OpenCV for every object.
I will be very greatful for sharing some info.

You could take a look at Wolfram Cloud/Mathematica.
It has the ability to detect object locations in a picture.
Some examples.
Detecting road signs.
Finding Waldo.
Object tracking in video.

Related

Detecting liveness with AWS Rekognition

I am looking to use AWS Reckognition in one of my projects and trying to find out whether or not its possible to differentiate between a still image (photograph) vs a real person, in other words liveness detection. I don't want my system to be fooled with a still photograph for authentication.
I see that it has many features such as pose and emotion detection, etc. If its not an official feature, is there a work around or any tricks that some of you have used to achieve what I want?
I am also wondering if its possible to detect gaze and how to best approach that. I want to see where the user is looking at, at the screen, to the side, etc.
Alternatively, if AWS does not have a good solution for this, what are some of your alternative recommendations?
Regards
Could you make use of blink detection, which isnt part of AWS Rekognition, to check if an image isn't a still photograph. You just need OpenCV.
Here is an example.
Face recognition alone is notoriously insecure when it comes to authentication, as has been evidenced by the many examples of the Android Face Unlock functionality being fooled by photographs.
Apple makes use of Depth sensing cameras in its FaceID technology to create a 3D map of the faces which cant be fooled by a photograph. Windows Hello face authentication utilises a camera specially configured for near infrared (IR) imaging to authenticate.
as alternative to gaze you can have a look at liveness example using aws rekognition based on face and nose position:
https://aws.amazon.com/pt/blogs/industries/improving-fraud-prevention-in-financial-institutions-by-building-a-liveness-detection-solution/
https://aws.amazon.com/blogs/industries/liveness-detection-to-improve-fraud-prevention-in-financial-institutions-with-amazon-rekognition/
https://github.com/aws-samples/liveness-detection

Vision Framework with ARkit and CoreML

While I have been researching best practices and experimenting multiple options for an ongoing project(i.e. Unity3D iOS project in Vuforia with native integration, extracting frames with AVFoundation then passing the image through cloud-based image recognition), I have come to the conclusion that I would like to use ARkit, Vision Framework, and CoreML; let me explain.
I am wondering how I would be able to capture ARFrames, use the Vision Framework to detect and track a given object using a CoreML model.
Additionally, it would be nice to have a bounding box once the object is recognized with the ability to add an AR object upon a gesture touch but this is something that could be implemented after getting the solid project down.
This is undoubtedly possible, but I am unsure of how to pass the ARFrames to CoreML via Vision for processing.
Any ideas?
Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...
Just about all of the pieces are there for what you want to do... you mostly just need to put them together.
You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)
ARFrame provides the current capturedImage in the form of a CVPixelBuffer.
You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.
You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.
You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)
One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".
Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).
If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.
If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.
If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.
Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.

Does Google Cloud Vision API support face recognition or face identification?

I am looking for a Google Cloud API that can do both face recognition and identification. My understanding is that the Google Cloud Vision API will support only face detection, but not recognition.
Is there any Google Cloud API that can do face recognition?
According to Google Vision API documentation. It doesn't support Face Recognition, only Face Detection and other attributes detection such as lable, landmark, web, label.
Face Detection : detect multiple faces within an image, along with the associated key facial attributes like emotional state or wearing headwear. Facial Recognition is not supported.
Label Detection : Detect broad sets of categories within an image, ranging from modes of transportation to animals.
Explicit Content Detection : Detect explicit content like adult content or violent content within an image.
Logo Detection : Detect popular product logos within an image.
Landmark Detection : Detect popular natural and man-made structures within an image.
Image Attributes : Detect general attributes of the image, such as dominant colors and appropriate crop hints.
Web Detection : Search the Internet for similar images.
Optical Character Recognition : Detect and extract text within an image, with support for a broad range of languages, along with support for automatic language identification.
Check out more details : https://cloud.google.com/vision/
Hope it can help you and more ideas and concepts
Cloud Vision currently supports face detection, but not face recognition.
That is, it can tell you whether or not there are faces in an image (and where they all are), but it cannot tell you which faces are in the image.
Google Cloud Vision Api doesn't offer Face Recognition, only Face Detection and 4 emotions in the face detected, and 3 properties like blurred, underexposed and with hat.
You can use the OpenCv library (not Google product) to create your custom application with pretrained machine learning model.

Emotion Recognition using Google Cloud Vision API?

I wish to use Google Cloud Vision API to generate features from images, that I will further use to train my SVM for emotion recognition problem. Please provide a detailed procedure for how to write a script in python that can use Google Cloud Vision API to generate features that I can directly feed into SVM.
I would go with following steps:
Training
Create a dataset(training + testing) for whichever emotions you want(such as anger, happy, etc.). This dataset must be diverse but balanced in terms of gender and age.
Extract the features of each face.
Normalize the whole dataset. Get the bounding box around faces and cut them from images. Also, normalize the sizes of each face.
Align the faces by using Roll and Eye coordinates which can be acquired from Google API.
Train an SVM(validate it, etc).
Testing
Acquire an image.
Extract the features.
Normalize and align the face.
Use SVM.
Library that I suggest:
scikit-learn - SVM
OpenCV - Image Manipulations

Classification of Lightning type in Images

I need to write an application that uses image processing functionality to identify the type of lightning in an image. The lightning types that it has to identify are the cloud to ground and the intracloud lightning which are shown in the pictures below. The cloud to ground lightning has these features: it hits the ground and has flashes branching downwards and the features of the intracloud lightning are that: it has no contact with the ground. Are there any image processing algorithms that you guys know which i can use to identify these features in the image such that the application will be able to identify the lightning type? I want to implement this in C++ using the CImg library.
Thanking you in advance
!!Since I cant upload photos because am a new user, i posted the links to the images!!
http://wvlightning.com/types.shtml
Wow, this seems like a fun algorithm. If you had a large set of images for each type you might be able to use HAAR training (http://note.sonots.com/SciSoftware/haartraining.html) but I'm not sure that would work because of the form of lightning. Maybe HAAR in combination with your own algorithm. For instance it should be very straightforward to know whether the lightning goes to the ground. You could use some OpenCV image analysis to do that - http://www.cs.iit.edu/~agam/cs512/lect-notes/opencv-intro/