I have a video with 3 persons speaking and I would like to annotate the location of people's eyes during it. I know that the Google Video Intelligence API has functionalities for object tracking, but it's possible to handle such an eye-tracking process using the API?
Google Video Intelligence API represents Face detection feature, which gives you opportunity to perform face detection from within video frames as well as special face attributes.
In general, you need to adjust FaceDetectionConfig throughout videos.annotate method, supplying includeBoundingBoxes and includeAttributes arguments in JSON request body:
{
"inputUri":"string",
"inputContent":"string",
"features":[
"FACE_DETECTION"
],
"videoContext":{
"segments":[
"object (VideoSegment)"
],
"faceDetectionConfig":{
"model":"string",
"includeBoundingBoxes":"true",
"includeAttributes":"true"
}
},
"outputUri":"string",
"locationId":"string"
}
There is a detailed (Python) example from Google on how to track objects and print out detected objects afterward. You could combine this with the AIStreamer live object tracking feature, to which you can upload a live video stream to get results back.
Some ideas/steps you could follow:
Recognize the eyes in the first frame of the video.
Set/highlight a box around the eyes you are tracking.
Track the eyes as an object in the next frames.
Related
i'm trying to recognize munchkin cards from the card game. i've been trying to use a variety of image recognition APIs(google vision api, vize.ai, azure's computer vision api and more), but none of them seem to work ok.
they're able to recognize one of the cards when only one appears in the demo image, but when both appear with another one it fails to identify one or the other.
i've trained the APIs with a set of about 40 different images per card, with different angles, backgrounds and lighting.
i've also tried using ocr(via google vision api) which works only for some cards, probably due to small letters and not much details on some cards.
Does anyone know of a way i can teach one of these APIs(or another) to read these cards better? or perhaps recognize cards in a different way?
the outcome should be a user capturing an image while playing the game and have the application understand which cards he has in front of him and return the results.
thank you.
What a coincidence! I've recently done something very similar – link to video – with great success! Specifically, I was trying to recognise and track Chinese-language Munchkin cards to replace them with English ones. I used iOS's ARKit 2 (requires an iPhone 6S or higher; or a relatively new iPad; and isn't supported on desktop).
I basically just followed the Augmented Reality Photo Frame demo 41 minutes into WWDC 2018's What's New in ARKit 2 presentation. My code below is a minor adaptation to theirs (merely replacing the target with a static image rather than a video). The tedious part was scanning all the cards in both languages, cropping them out, and adding them as AR resources...
Here's my source code, ViewController.swift:
import UIKit
import SceneKit
import ARKit
import Foundation
class ViewController: UIViewController, ARSCNViewDelegate {
#IBOutlet var sceneView: ARSCNView!
override func viewDidLoad() {
super.viewDidLoad()
var videoPlayer: AVPlayer
// Set the view's delegate
sceneView.delegate = self
// Show statistics such as fps and timing information
sceneView.showsStatistics = true
sceneView.scene = SCNScene()
}
override func viewWillAppear(_ animated: Bool) {
super.viewWillAppear(animated)
// Create a configuration
let configuration = ARImageTrackingConfiguration()
guard let trackingImages = ARReferenceImage.referenceImages(inGroupNamed: "card_scans", bundle: Bundle.main) else {
print("Could not load images")
return
}
// Setup configuration
configuration.trackingImages = trackingImages
configuration.maximumNumberOfTrackedImages = 16
// Run the view's session
sceneView.session.run(configuration)
}
override func viewWillDisappear(_ animated: Bool) {
super.viewWillDisappear(animated)
// Pause the view's session
sceneView.session.pause()
}
// MARK: - ARSCNViewDelegate
// Override to create and configure nodes for anchors added to the view's session.
public func renderer(_ renderer: SCNSceneRenderer, nodeFor anchor: ARAnchor) -> SCNNode? {
let node = SCNNode()
if let imageAnchor = anchor as? ARImageAnchor {
// Create a plane
let plane = SCNPlane(width: imageAnchor.referenceImage.physicalSize.width,
height: imageAnchor.referenceImage.physicalSize.height)
print("Asset identified as: \(anchor.name ?? "nil")")
// Set UIImage as the plane's texture
plane.firstMaterial?.diffuse.contents = UIImage(named:"replacementImage.png")
let planeNode = SCNNode(geometry: plane)
// Rotate the plane to match the anchor
planeNode.eulerAngles.x = -.pi / 2
node.addChildNode(planeNode)
}
return node
}
func session(_ session: ARSession, didFailWithError error: Error) {
// Present an error message to the user
}
func sessionWasInterrupted(_ session: ARSession) {
// Inform the user that the session has been interrupted, for example, by presenting an overlay
}
func sessionInterruptionEnded(_ session: ARSession) {
// Reset tracking and/or remove existing anchors if consistent tracking is required
}
}
Unfortunately, I met a limitation: card recognition becomes rife with false positives the more cards you add as AR targets to distinguish from (to clarify: not the number of targets simultaneously onscreen, but the library size of potential targets). While a 9-target library performed with 100% success rate, it didn't scale to a 68-target library (which is all the Munchkin treasure cards). The app tended to flit between 1-3 potential guesses when faced with each target. Seeing the poor performance, I didn't take the effort to add all 168 Munchkin cards in the end.
I used Chinese cards as the targets, which are all monochrome; I believe it could have performed better if I'd used the English cards as targets (as they are full-colour, and thus have richer histograms), but on my initial inspection of a 9-card set in each language, I was receiving as many warnings for the AR resources being hard to distinguish for English as I was for Chinese. So I don't think the performance would improve so far as to scale reliably to the full 168-card set.
Unity's Vuforia would be another option to approach this, but again has a hard limit of 50-100 targets. With (an eye-wateringly expensive) commercial licence, you can delegate target recognition to cloud computers, which could be a viable route for this approach.
Thanks for investigating the OCR and ML approaches – they would've been my next ports of call. If you find any other promising approaches, please do leave a message here!
You are going to wrong direction. As i understand, you have an image. And inside that image, there are several munchkin cards (2 in your example). It is not just only "Recognition" but also "Card detection" is needed. So your task should be divided into card detection task and card's text recognition task
For each task you can use the following algorithm
1. Card detection task
Simple color segmentation
( if you have enough time and patient, train SSD to detect card)
2. Card's text recognition
Use tesseract with english dictionary
(You could add some card rotating process to improve accuracy)
Hope that help
You can try this: https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts/csharp#OCR. It will detect text and then you can have your custom logic (based on detected text) to handle actions.
I'm using Beam to build a streaming ML pipeline that's very similar to Zeitgeist system mentioned in the original MillWheel paper. However, I'm have difficulty in using trained models to do online predictions (the blue vertical arrow "Models" in Figure 1. of the paper).
It seems that Zeitgeist models are updated incrementally each time (?) when a new windowed counter comes in. However, the specific model I'm using doesn't support incremental/online training, so I need to use some trigger/windowing to train the model in batches.
During prediction, I don't know how to align windows of features and batch-trained models.
using side input (see below) can make the pipeline run, but it stucks at the prediction step waiting for the model to be materialized.
CoGroupByKey does not work because the windows of features and ensembledModels are not the same.
I also want to do model ensemble, which makes things even more complicated.
Here's a rough sketch of what I have
// Correspond to "Window Counter" in MillWheel paper
final PCollection<Feature> features = pipeline
.apply(PubsubIO.read(...))
.apply(...some windowing...)
.apply(ParDo.of(new FeatureExtractor()));
// Correspond to "Model Calculator" in MillWheel paper
final PCollection<Iterable<Model>> ensembledModels = features
.apply(...some windowing and data triggers...)
.apply(new ModelTrainer()));
// Correspond to "Spike/Dip Detector" in MillWheel paper
// The pipeline stucks at here when run
final PCollection<Score> score = features
.apply(ParDo.of(new Predictor()).withSideInput(ensembledModels));
I have the Google Mirror API's Quick Start for PHP up and running on a Microsoft Azure Website and can communicate with Google Glass.
I had a closer look to the options like the "request/response" example:
case 'insertItemWithAction':
$new_timeline_item = new Google_TimelineItem();
$new_timeline_item->setText("What did you have for lunch?");
$notification = new Google_NotificationConfig();
$notification->setLevel("DEFAULT");
$new_timeline_item->setNotification($notification);
$menu_items = array();
// A couple of built in menu items
$menu_item = new Google_MenuItem();
$menu_item->setAction("REPLY");
array_push($menu_items, $menu_item);
$menu_item = new Google_MenuItem();
$menu_item->setAction("READ_ALOUD");
array_push($menu_items, $menu_item);
$new_timeline_item->setSpeakableText("What did you eat? Bacon?");
$menu_item = new Google_MenuItem();
$menu_item->setAction("SHARE");
array_push($menu_items, $menu_item);
(from https://github.com/googleglass/mirror-quickstart-php/blob/master/index.php)
I am now wondering if it is possible to use the Google Glass Mirror API to scan a QR code.
The idea is to replace the user having to speak a control digit, convert the control digit to a QR code and have the user scan the QR code without having to speak.
Is this possible?
You cannot present a QR Code scanning screen to your user by only using the Mirror API. Nor can you add a MenuItem allowing the user to send back a picture.
But, you can register as a contact, and have your users share with you pictures containing QR Codes.
More info about registering as a contact
More info about receiving shares
This is not a very fluid user experience, but it's the only way you could "scan" QR Codes while only using the Mirror API.
I'm testing my app based on a friend list of about 350 people, and not seeing any pagination on /me/friends/.
What's unclear to me (by testing & documentation) is the following:
At how many friends do the graph.facebook.com/me/friends or graph.facebook.com/ID/friends open graph calls starts to paginate, if at all?
look at Graph API Explorer: https://developers.facebook.com/tools/explorer?method=GET&path=me%2Ffriends and scroll to the very bottom - you will there something like:
"paging": {
"next": "https://graph.facebook.com/me/friends?format=json&limit=5000&offset=5000&__after_id=XXX"
}
which leads me to believe that default page size is 5000
you can set that limit explicitly if you want to: https://developers.facebook.com/tools/explorer?method=GET&path=me%2Ffriends%26limit%3D1
Set the limit yourself using limit=5000 for max limit. I.e so /me/friends?limit=5000
With the JavaScript SDK, two fields are returned, data and paging. Just hit the nextPage URL, and on the next result, there will be no more paging variable/next variable, indicating that you've hit the end.
I am currently working on a Django project that will hopefully make some transformations to videofiles thru the web. To make the transformation to the videos, I am using opencv's python API and I am also uing Dajax to perform ajax requests.
In the ajax requests file i have the following code:
#dajaxice_register
def transform_and_show(request, filename, folder, frame_count, img_number):
detector = Detector(filename) //Object which uses opencv API
dajax = Dajax()
generated_file = detector.detect_people_by_frame(folder, str(img_number))
dajax.assign('#video', 'src', '/media/generated'+folder+generated_file)
return dajax.json()
The idea is to tranform videos frame by frame and to display each transformed frame in the browser in an img tag giving the sensation to the user that he/she is watching the transformed video, so this method is called in a javascript loop.
The problem is that in this approach, the object "detector" is reinitialized in every iteration so it only generates the image corresponding to the first frame of the video. My idea was to workaround this issue by making "detector" persistent between requests so that the pointer to the next frame of the video wouldn't be set to 0 on every request.
The problem is that Dectector object is not picklable, meaning that it cannot be cached or saved to a session object.
Is there anything I can do to make it persistent between requests?
NOTE: I have considered using HTTP push approaches like APE or Orbit but since this is just an investigation project there is no real concern about performance.
Have you tried a module level variable to store the object?
Make "detector" a global at the file level.
detector = None
def transform():
global detector
if detector is None:
detector = Detector(filename)
file = detector.detect(....)