3D Object Detection & Tracking using PCL - computer-vision

I have a task where i am asked to track parcels(carton boxes) of different dimensions moving on a conveyor
I am using Asus Xtion pro camera and also i have been asked to use Point cloud data to detect & track the object on the conveyor
I have read so many model-based methods where we need to create a model of the object to be detected and then we perform keypoints extraction,feature mapping and so many other concepts between the scene and the model.
Since i am using Boxes of different dimensions, i definitely need a model of each of them to match with the scene.
My question : Can i have one common point cloud model of a box of some dimension and compare it with any box that comes under the view of a camera? I meant can i have one scalable model to compare with box of any dimension? is it possible?
My chief doesn't want the project to be dependent on many models for detection and tracking. One common model which is scalable or parameterized should do the trick it seems.
Thanks in advance

Related

Train a model to draw bounding boxes on certain objects in an image?

Is it possible to use GCP Machine Learning products to train a model to draw bounding boxes on certain objects in an image? I'd like to be able to feed labeled images and have it predict where that label would belong.
I think you are looking for something like this, where the Tensorflow machine learning library is used:
https://cloud.google.com/solutions/creating-object-detection-application-tensorflow
A note:
When you say that you want to be able to feed labeled images and have it predict where that label would belong, i assume you mean where that object is present in the image in terms of the bounding box coordinates. If so then the library should take care of that for you, your job is just to train the network with your labeled images.

Vision Framework with ARkit and CoreML

While I have been researching best practices and experimenting multiple options for an ongoing project(i.e. Unity3D iOS project in Vuforia with native integration, extracting frames with AVFoundation then passing the image through cloud-based image recognition), I have come to the conclusion that I would like to use ARkit, Vision Framework, and CoreML; let me explain.
I am wondering how I would be able to capture ARFrames, use the Vision Framework to detect and track a given object using a CoreML model.
Additionally, it would be nice to have a bounding box once the object is recognized with the ability to add an AR object upon a gesture touch but this is something that could be implemented after getting the solid project down.
This is undoubtedly possible, but I am unsure of how to pass the ARFrames to CoreML via Vision for processing.
Any ideas?
Update: Apple now has a sample code project that does some of these steps. Read on for those you still need to figure out yourself...
Just about all of the pieces are there for what you want to do... you mostly just need to put them together.
You obtain ARFrames either by periodically polling the ARSession for its currentFrame or by having them pushed to your session delegate. (If you're building your own renderer, that's ARSessionDelegate; if you're working with ARSCNView or ARSKView, their delegate callbacks refer to the view, so you can work back from there to the session to get the currentFrame that led to the callback.)
ARFrame provides the current capturedImage in the form of a CVPixelBuffer.
You pass images to Vision for processing using either the VNImageRequestHandler or VNSequenceRequestHandler class, both of which have methods that take a CVPixelBuffer as an input image to process.
You use the image request handler if you want to perform a request that uses a single image — like finding rectangles or QR codes or faces, or using a Core ML model to identify the image.
You use the sequence request handler to perform requests that involve analyzing changes between multiple images, like tracking an object's movement after you've identified it.
You can find general code for passing images to Vision + Core ML attached to the WWDC17 session on Vision, and if you watch that session the live demos also include passing CVPixelBuffers to Vision. (They get pixel buffers from AVCapture in that demo, but if you're getting buffers from ARKit the Vision part is the same.)
One sticking point you're likely to have is identifying/locating objects. Most "object recognition" models people use with Core ML + Vision (including those that Apple provides pre-converted versions of on their ML developer page) are scene classifiers. That is, they look at an image and say, "this is a picture of a (thing)," not something like "there is a (thing) in this picture, located at (bounding box)".
Vision provides easy API for dealing with classifiers — your request's results array is filled in with VNClassificationObservation objects that tell you what the scene is (or "probably is", with a confidence rating).
If you find or train a model that both identifies and locates objects — and for that part, I must stress, the ball is in your court — using Vision with it will result in VNCoreMLFeatureValueObservation objects. Those are sort of like arbitrary key-value pairs, so exactly how you identify an object from those depends on how you structure and label the outputs from your model.
If you're dealing with something that Vision already knows how to recognize, instead of using your own model — stuff like faces and QR codes — you can get the locations of those in the image frame with Vision's API.
If after locating an object in the 2D image, you want to display 3D content associated with it in AR (or display 2D content, but with said content positioned in 3D with ARKit), you'll need to hit test those 2D image points against the 3D world.
Once you get to this step, placing AR content with a hit test is something that's already pretty well covered elsewhere, both by Apple and the community.

Object recognition of a set of objects

In a computer vision project, the image I want to process can be partitioned in "zones" containining multiple products of the same kind.
Provided that I can retrieve image information of all the possible kinds of product, I need to detect which kind is present in each zone, without the need to detect the position of each single product. In summary, I need to recognize "sets of products".
As additional info, the products have not a rigid shape, they are not oriented in the same manner and luminosity changes (so I am basically searching for shape, orientation and luminosity invariant approaches).
The reliable info I can exploit is that the products logos - or parts of them - are often visible and the products are quite colorful.
I would like to know about possible approaches that exploit the fact that I know the zones partition and approaches that do not exploit it.

Design of virtual trial room

As a part of my masters project I proposed to build a virtual trial room application intended for retail clothing stores. Currently its meant to be used directly in store though it may be extended for online stores as well.
This application will show customers how a selected apparel would look on them by showing it on their 3D replica on screen.
It involves 3 steps
Sizing up the customer
Building customer replica 3D humanoid model
Apply simulated cloth on the model
My question is about the feasibility of the project and choice of framework.
Can this be achieved in real time using a normal Desktop computer? If yes what would be appropriate framework ( hardware, software, programming language etc ) for this purpose?
On the work I have done till now, I was planning to achieve above steps in following ways
for step 1 : option a) Two cameras for front and side views or
option b) 1 Kinect or 2 Kinect for complete 3D data
for step 2: either use makehuman (http://www.makehuman.org/) code to build a customised 3D model using above data or build everything from scratch, unsure about the framework.
for step 3: Just need few cloth samples, so thought of building simulated clothes in blender.
Currently I have just the vague idea about different pieces but I am not sure of how to develop complete application.
Theoretically this can be achieved in real time. Many usefull algorithms for video tracking, stereo vision and 3d recostruction are available in OpenCV library. But it's very difficult to build robust solution. For example, you'll probably need to track human body which moves frame to frame and perform pose estimation (OpenCV contains POSIT algorithm), however it's not trivial to eliminate noise in resulting objects coordinates. For inspiration see a nice work on video tracking.
You might want to choose another way, simplify some things, avoid complicated stuff do things less dynamicaly and estimate only clothes size and approximate human location. I this case most likely you will create something usefull and interesting.
I've lost link to one online fiting room where hands and body detection implemented. Using Kinnect solves many problems. But If for some reason you won't use it then AR(augmented reality) helps you (yet another fitting room)

Scene graph implementation for Papervision?

I'm trying to use Papervision for Flash, for this project of mine, which involves a 3D model of a mechanical frame, consisting of several connected parts. Movement of one of the parts results in a corresponding change in orientation and position of other parts of the frame.
My understanding is that using a scene graph to handle this kind of linked movement would be the ideal way to go, at least, if I were to implement in one of the more established 3D development options, like OpenGL or DirectX.
My question is, is there an existing scene graph implementation for Papervision? Or, an alternative way to generate the required 3D motion?
Thank!
I thought Papervision is basically a Flash-based 3D rendering engine, therefore should contain its own scene graph.
See org.papervision3d.scenes.Scene3D in the API.
And see this article for a lengthier explanation of the various objects in Papervision. One thing you can do is google for articles with the key objects in P3D, such as EngineManager, Viewport3D, BasicRenderEngine, Scene3D and Camera3D.
As for "generating the motion", it depends on what you are trying to achieve exactly. Either you code that up and alter the scene yourself, or use a third-party library like a physics library so as to not have to code all that up yourself.
You can honestly build one in the time it would take you to search for one:
Create a class called Node with a virtual method Render(matrix:Matrix), which holds an array of child nodes.
Create a subclass of Node called TransformNode which takes a reference to a matrix.
Create a subclass of Node called ModelNode which takes a reference to a model.
The Render method of TransformNode multiplies the incoming matrix with its own, then calls the render method of its children with the resulting matrix.
The Render method of ModelNode sends its model off to the renderer at the location specified by the incoming matrix.
That's it. You can enhance things further with a BoundsNode that doesn't call its children if it's bounding shape is not visible in the viewing frustum.