I am trying to build a program that takes just the hand inputs from Kinect
I need acquire 3 things,
-the streaming of kinect depth video with OpenGL output data on top of it
-recognition of just 2 simple hand gestures, open hand and closed fist, I will build some function to solve it down to Boolean form for each hand
-left and right hand positions, if possible to do more than 2 hands, that would be great
Basically to do a click and drag mouse operation with open and close hand motion in kinect, let's start with just one hand, if it's possible to do more than 2 hands, I will learn that later.
From what I have read so far, Kinect could do this easily without any extra libraries, so I should be able to build my application with just Kinect library and OpenGL
I heard there are tons of examples for this online, but all I found so far are in C#, not C++, the other components for my program are only in C++ and I want to stay with C++ if possible.
There are essentially two layers:
Interaction Stream (C++ or managed)
Interaction Controls (managed only, WPF-specific)
The WPF controls are implemented in terms of the interaction stream.
If you are using a UI framework other than WPF, you will need to do the following:
Implement the "interaction client" interface. This interface has a
single method, GetInteractionInfoAtLocation. This method will be
called repeatedly by the interaction stream as it tracks the user's
hand movements. Each time it is called, it is your responsibility to
return the "interaction info" (InteractionInfo in managed,
NUI_INTERACTION_INFO in C++) for the given user, hand, and position.
Essentially, this is how the interaction stream performs hit-testing
on the controls within your user interface.
Create an instance of the interaction stream, supplying it a
reference to your interaction client implementation.
Start the Kinect sensor's depth and skeleton streams.
For each depth and skeleton frame produced by the sensor streams,
pass the frame's data to the appropriate method (ProcessDepth or
ProcessSkeleton) of the interaction stream. As the interaction stream
processes the input frames from the sensor, it will produce
interaction frames for your code to consume. In C++, call the
interaction stream's GetNextFrame method to retrieve each such frame.
In managed code, you can either call OpenNextFrame, or subscribe to
the InteractionFrameReady event.
Read the data from each interaction frame to find out what the user
is doing. Each frame has a timestamp and a collection of user info
structures, each of which has a user tracking ID and a collection of
hand info structures, which provide information about each hand's
position, state, and grip/ungrip events
You can find a complete sample here.
Related
I write a MFC project . I use IXAudio2 to play wav file.
my code is like this :
pSourceVoice->SubmitSourceBuffer( &buffer );
hr = pSourceVoice->Start( 0 );
but in this way I only can play one sound by a time. It must wait this .wav file play over. I can play the second file. How can I play the file I need and not wait for the first one is over, likes mixing sounds?
How can I achieve this function?
You create one IXAudio2SourceVoice for each 'playing, active' sound you want at a time. Typically games will create a few hundred source voices to manage--beyond that the sound mix is typically too muddy to hear anyhow.
Game audio engines have a mix of 'one-shot' voices that just play-on-demand until they complete--at which point that frees up that source voice for use by another one-shot sound--, 'ambients' or 'music' voices that are manipulated as they play, and 3D audio positioned sounds which have to be updated every rendering frame to track the 3D rendering object they are emitted from.
If you submit the same audio data to two distinct IXAudio2SourceVoice instances, then you'll hear it being played twice, mixed together in real-time. Typically you won't start them at precisely the same instant because the result is just a louder version of the one sound. Normally they are started at different times so you get overlapped playback.
See XAudio2 Programming Guide, GitHub, and DirectX Tool Kit for Audio.
While I have been researching best practices and experimenting multiple options for an ongoing project(i.e. Unity3D iOS project in Vuforia with native integration, extracting frames with AVFoundation then passing the image through cloud-based image recognition), I have come to the conclusion that I would like to use ARkit, Vision Framework, and CoreML; let me explain.
I am wondering how I would be able to capture ARFrames, use the Vision Framework to detect and track a given object using a CoreML model.
Additionally, it would be nice to have a bounding box once the object is recognized with the ability to add an AR object upon a gesture touch but this is something that could be implemented after getting the solid project down.
This is undoubtedly possible, but I am unsure of how to pass the ARFrames to CoreML via Vision for processing.
Any ideas?
Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...
Just about all of the pieces are there for what you want to do... you mostly just need to put them together.
You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)
ARFrame provides the current capturedImage in the form of a CVPixelBuffer.
You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.
You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.
You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)
One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".
Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).
If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.
If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.
If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.
Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.
I would like to do skeletal tracking simultaneously from two kinect cameras in SkeletalViewer and obtain the skeleton result. As in my understanding, the Nui_Init() only process the threads for first Kinect (which I suppose index = 0). However could I have the two skeletal tracking run at the same time as I would like to output their result respectively into two text files at the same time.
(eg. for Kinect 0 output to "cam0.txt" while Kinect 1 output to "cam1.txt")
Does anyone has experience in such case or able to help?
Regards,
Eva
PS: I read this from Kinect SDK documentation state that:
If you are using multiple Kinect sensors, skeleton tracking works only on the first device that you Initialize. To switch the device being used to track, uninitialized the old one and initialize the new one.
So is it possible if I want to acquire the coordinates simultaneously? Or even if acquire one by one, how should I call them separately? (as I realize the index of the active Kinect will be 0 which I can't differentiate them).
I assume that you are using MS SkeletalViewer example. The problem with their SkeletalViewer is that they closely tied the display and the skeleton tracking. This make it difficult to change.
Using multiple kinect sensor should be possible, you just need to initialize all the sensors the same way. The best thing to do would be to define a sensor class to wrap kinect sensors. If you don't need the display, you can just write a new program. That's a bit of work but not that much, you can probably get a fully working program for multiple sensors in less than 100 lines. If you need the display, you can rewrite the SkeletalViewer example to use your sensor class but that's more tedious.
I'm a DirectShow newbie. I'm trying to get DirectShow to playback a set of media files but NOT simultaneously.
I've tried allocating one graph and using RendeFile to add each file into it but when I invoke IMediaControl::Run, they ALL begin playing at the same time.
I've tried allocating one graph and one IMediaControl per file and then calling Run at different times on each. This works, the streams play independantly.
How do I combine the streams to an output window?
Is it possible to have a master surface on which the other streams are rendered into rectangles?
Since the streams are not in the same graph, can it be done?
What do I use for a surface or output?
Thanks
All filters are expected to change state in a graph together, so you indeed need a separate graph for every file you suppose to play back independently from others.
If you are going to play files simply side by side, without effects and overlapping etc., the easiest option is to use separate video renderers and use them as controls, properly positioning them in your UI.
If you instead want something more sophisticated, then there are two ways to choose from: you either take decompressed video/audio out of DirectShow filter graph using Sample Grabber or a similar filter, and then you are responsible for presenting data yourself with other APIs. Or you implement a custom allocator/presented (also known as renderless mode of operation of video renderer) and finely control the output of video, which in particular allows to get a frame into texture or offscreen surface leaving presentation itself to you.
I am creating an app in C++ (OpenGL) using Kinect. Whenever we click in OpenGL the function invoked is
void myMouseFunction( int button, int state, int mouseX, int mouseY )
{
}
But can we invoke them using Kinect? Maybe we have to use the depth buffer for it, but how?
First: You don't "click in openGL", because OpenGL doesn't deal with user input. OpenGL is purely a rendering API. What you're referring to is probably a callback to be used with GLUT; GLUT is not part of OpenGL, but a free standing framework, that also does some user input event processing.
The Kinect does not generate input events. What the Kinect does is, it returns a depth image of what it "sees". You need to process this depth image somehow. There are frameworks like OpenNI which process this depth image and translate it in gesture data or similar. You can then process such gestures data and process it further to interpret it as user input.
In your tags you referred to "openkinect", the open source drivers for the Kinect. However OpenKinect does not gesture extraction and interpretation, but only provides the depth image. You can of course perform simple tests on the depth data as well. For example testing of there's some object within the bounds of some defined volume and interpret this as sort of an event.
I think you are confusing what the Kinect really does. The Kinect feeds depth and video data to your computer, which will then have to process it. Openkinect only does very minimal processing for you -- no skeleton tracking. Skeleton tracking allows you to geta 3D representation of where each of your user's joints is.
If you're just doing some random hacking, you could perhaps switch to the KinectSDK -- with the caveat that you will only be able to develop and deploy on Windows.
KinectSDK works with OpenGL and C++, too, and you can get a said user's "skeleton".
OpenNI -- which is multiplatform and free as in freedom -- also supports skeleton tracking, but I haven't used it so I can't recommend it.
After you have some sort of skeleton tracking up, you can focus on the user's hands and process his movements to get your "mouse clicks" working. This will not use GLUT's mouse handler's though.