I have a set of simple rigid 3D objects that I wish to detect and recognize from an image (let's say 5 to 10 classes). The objects are simple in sense that they are cylinders in one color or rectangles with simple patterns (stripes for example) or some similarly simple shape. The objects are significantly different from one another (there aren't for example two classes where one is a large cylinder and another one is the same but smaller cylinder).
Because the textures are pretty simple (solids and/or simple patterns), bag-of-words approach fails (they do not contain significant number of unique edges).
While one possible approach is coding manually each classifier (manual feature extraction etc), is there a simple data driven approach (Haar/LBP classifier for example) that would work? If Haar or LBP are good for solving this problem, how would one solve the problem of unknown relative viewpoint (and by such perspective distortion, rotation, etc)? Would just providing positive images from all possible viewpoints for an object converge or is there something else that's usually done? The detection and recognition should run in real-time.

Based on your description of your problem, I see several drawbacks of a Haar or LBP-based detector. First, these features do not use color, which seems to be important here. Second, a classifier using Haar or LBP features is sensitive to in-plane and out-of-plane rotation. If your objects can be in any 3D orientation, you would need to discretize the range of 3D rotations and train a separate detector for each one. For example, for face detection you typically use two detectors: one for frontal faces, and one for profile faces. Finally, if there is not enough texture for bag-of-words, there also may not be enough texture for Haar or LBP.
Since your objects are simple 3D shapes, I would start by trying to detect straight lines and circles using the Hough transform, and trying to group them to form the object's outlines.


I'm using the ORB algorithm to detect and get the coordinates of the crossings of rope shown in the image, which is represented by the red dot. I want to detect the coordinates of the four points surrounding the crossing represented by the blue dots. All the four points have the same distance from the red spot.
Any idea how to get their coordinates by getting use of the red spot coordinate.
Thank you in Advance
Although you're using ORB, you're still going to need an algorithm to segment the rope from the background, or at least some technique to identify image chunks that belong to the rope and that are equidistant from the red dot. There are a number of options to explore.
It's important to consider your lighting & imaging as separate problems to be solved if this is meant to be a real-world application. This looks a bit like a problem for a class rather than for a application you'll sell and support, but you should still consider lighting:
Will your algorithm(s) still work when light level is reduced?
How will detection be affected by changes in camera pose relative to the surface where the rope will be located?
If you'll be detecting "black" rope, will the algorithm also be required to detect rope of different colors? dirty rope? rope on different backgrounds?
Since you're object of interest is rope, you have to consider a class of algorithms suitable for detection of non-rigid objects. Always consider the simplest solution first!
Connected Components
Connected components labeling is a traditional image processing algorithm and still suitable as the starting point for many applications. The last I knew, this was implemented in OpenCV as findContours(). This can also be called "blob finding" or some variant thereof.
Depending on lighting, you may have to take different steps to binarize the image before running connected components. As a start, convert the color image to grayscale, which will simplify the task significantly.
Try a manual threshold since you can quickly test a number of values to see the effect. Don't be too discouraged if the binarization isn't quite right--this can often be fixed with preprocessing.
If a range of manual thresholds works (e.g. 52 - 76 in an 8-bit grayscale range), then use an algorithm that will automatically calculate the threshold for you: Otsu, entropy-based methods, etc., will all offer comparable performance. Whichever technique works best, the code/algorithm can be tweaked further to optimize for your rope application.
If thresholding and binarization don't work--which for your rope application seems unlikely, at least how you've presented it--then switch to thinking in terms of gradient-based (edge-based, energy-based) techniques.
But assuming you can separate the rope from the background, you're still going to need a method to start at the red dot [within the rope] and move equal distances out to the blue points. More about that later after a discussion of other rope segmentation methods.
Note: connected components labeling can work in scenarios beyond just binarizing black & white images. If you can create a texture field or some other 2D representation of the image that makes it possible to distinguish the black rope from the relatively light background, you may be able to use a connected components algorithm. (Finding a "more complicated" or "more modern" algorithm isn't necessarily going to be the right approach.)
In a binarized image, blobs can be nested: on a white background you can have several black blobs, inside of one or more of which are white blobs, inside of which are black blobs, etc. An earlier version of OpenCV handled this reasonably well. (OpenCV is a nice starting point, and a touchpoint for many, but for a number of reasons it doesn't always compare favorably to other open source and commercial packages; popularity notwithstanding, OpenCV has some issues.)
Once you have a "blob" (a 4-connected region of pixels) in a 2D digital image, you can treat the blob as an object, at which point you have a number of options:
Edge tracing: trace around the inside and outside edges of the blob. From what I recall, OpenCV does (or at least should) have some relatively straightforward method to get the edges.
Split the blob into component blobs, each of which can be treated separately
Convert the blob to a polygon
A connected components algorithm should be high on the list of techniques to try if you have a non-rigid object.
Boolean Operations
Once you have the rope as a connected component (and possibly even without this), you can use boolean image operations to find the spots at the blue dots in your image:
Create a circular region in data, or even in the image
Find the intersection of the circle (an annulus) and the black region representing the rope. Using your original image, you should have four regions.
Find the center point of the intersection regions.
You could even try this without using connected components at all, but using connected components as part of the solution could make it more robust.
Polygon Simplification
If you have a blob, which in your application would be a connected set of black pixels representing the rope on the floor, then you can consider converting this blob to one or more polygons for further processing. There are advantages to working with polygons.
If you consider only the outside boundary of the rope, then you can see that the set of pixels defining the boundary represents a polygon. It's a polygon with a lot of points, and not a convex polygon, but a polygon nonetheless.
To simplify the polygon, you can use an algorithm such as Ramer-Douglas-Puecker:
Once you have a simplified polygon, you can try a few techniques to render useful data from the polygon
Angle Bisector Network
Triangulation (e.g. using ear clipping)
Triangulation is typically dependent on initial conditions, so the resulting triangulation for slighting different polygons (that is, rope -> blob -> polygon -> simplified polygon). So in your application it might be useful to triangulate the dark rope region, and then to connect the center of one triangle to the center of the next nearest triangle. You'll also have to deal with crossings, such as the rope overlap. Ultimately this can yield a "skeletonization" of the rope. Speaking of which...
If the rope problem was posed to you as a class exercise, then it may have been a prompt to try skeletonization. You can read about it here:
Skeletonization and thinning have their own problems to solve, but you should dig into them a bit and see those problems themselves.
The Medial Axis Transform (MAT) is a related concept. Long story there.
Edge-based techniques
There are a number of techniques to generate "edge images" based on edge strength, energy, entropy, etc. Making them robust takes a little effort. If you've had academic training in image processing you've likely heard of Harris, Sobel, Canny, and similar processing methods--none are magic bullets, but they're simple and dependable and will yield data you need.
An "edge image" consists of pixels representing the image gradient strength [and sometimes the gradient direction]. People may call this edge image something else, but it's the concept that matters.
What you then do with the edge data is another subject altogether. But one reason to think of edge images (or at least object borders) is that it reduces the amount of information your algorithm(s) will need to process.
Mean Shift (and related)
To get back to segmentation mentioned in the section on connected components, there are other methods for segmenting figures from a background: K-means, mean shift, and so on. You probably won't need any of those, but they're neat and worth studying.
Stroke Width Transform
This is an intriguing technique used to extract text from noisy backgrounds. Although it's intended for OCR, it could work for rope since the rope width is relatively constant, the rope shape varies, there are crossings, etc.
In short, and simplifying quite a bit, you can think of SWT as a means to find "strokes" (thick lines) by finding gradients antiparallel to each other. On either side of a stroke (or line), the edge gradient points normal to the object edge. The normal on one side of the stroke points opposite the direction of the normal on the other side of the stroke. By filtering for pixel-gradient pairs within a certain distance of each other, you can isolate certain strokes--even automatically. For your example the collection of points representing edge pairs for the rope would be much more common than other point pairs.
Non-Rigid Matching
There are techniques for matching non-rigid shapes, but they would not be worth exploring. If any of the techniques I mentioned above is unfamiliar to you, explore some of those first before you try any fancier algorithms.
CNNs, machine learning, etc.
Just don't even think of these methods as a starting point.
Other Considerations
If this were an application for industry, security, or whatnot, you'd have to determine how well your image processing worked under all environmental considerations. That's not an easy task, and can make all the difference between a setup that "works" in the lab and a setup that actually works in practice.
I hope that's of some help. Feel free to post a reply if I've confused more than helped, or if you want to explore some idea in more detail. Though I tried to touch on some common(ish) techniques, I didn't mention all the different ways of addressing this problem.
And briefly: once you have a skeleton, point network, or whatever representing a reduced data set for the rope and the red dot (the identified feature), a few techniques to find the items at the blue dots:
For a skeleton, trace along each "branch" of the rope outward from the know until the geodesic distance or straight-line 2D distance is the distance D that you want.
To use geometry, create a circle of width 1 - 2 pixels. Find the intersection of that circle and the rope. Find the center point of the intersections of circle and rope. (Also described above.)
Good luck!

Is there any method to create a polygon(not a rectangle) around an object in an image for object recognition.
Please refer the following images:
the result I am looking for
the original image
I am not looking for bounding rectangles like this.I know the concepts of transfer learning, using pre-trained models for object recognition and other object detection concepts.
The main aim is the object detection but not giving results using bounding box but a fitter polygon instead.Link to some resources or papers will be helpful.
Here is a very simple (and a bit hacky) idea, but it might help: take a per-pixel scene labeling algorithm, e.g. SegNet, and then turn the resulting segmented image into a binary image, where the white pixels are the class of interest (in your example, white for cars and black for the rest). Now compute edges. You can add those edges to the original image to obtain a result similar to what you want.
What you want is called image segmentation, which is different to object detection. The best performing methods for common object classes (e.g. cars, bikes, people, dogs,...) do this using trained CNNs, and are usually called semantic segmentation networks awesome links. This will, in theory, give you regions in your image corresponding to the object you want. After that you can fit an enclosing polygon using what is called the convex hull.

I am stitching together multiple images with arbitrary 3D views of a planar surface. I have some estimation of which images overlap and a coarse estimate of each pairwise homography between pairs of overlapping images. However, I need to refine my homographies by minimizing the global error across all images.
I have read a few different papers with various methods for doing this, and I think the best way would be to use a non-linear optimization such as Levenberg–Marquardt, ideally in a fast way that is sparse and/or parallel.
Ideally I would like to use an existing library such as sba or pba, but I am really confused as to how to limit the calculation to just estimating the eight parameters of the homography rather than the full 3 dimensions for both camera pose and object position. I also found this handy explanation by Szeliski (see section 5.1 on page 50) but again, the math is all for a rotating camera rather than a flat surface.
How do I use L-M to minimize the global error for a set of homographies? Is there a speedy way to do this with existing bundle adjustment libraries?
Note: I cannot use methods that rely on rotation-only camera motion (such as in openCV) because those cannot accurately estimate camera poses, and I also cannot use full 3D reconstruction methods (such as SfM) because those have too many parameters which results in non-planar point clouds. I definitely need something specific to a full 8 parameter homography. Camera intrinsics don't really matter because I am already correcting those in an earlier step.
Thanks for your help!

I am using opencv c++ and am a new user. I am interested in object detection problems . So far I have studies and implemented the use of sparse optical flow( Lucas Kanade method) in a video from a stationary camera.After trying k means and Background substraction , I have decided to move to a more difficult problem , that is the moving camera.
I have so far studied some documentation and found out that I could use cv::findHomography in order to find the inliers or outliers during the sequence of frames in my video and then understand from the returned values what movement is caused due to camera motion and what due to object motion. In addition , I could use SURF features to track some objects and then decide which of them are good points .
However , I was wondering how I could implement this theory. For example, should I use the first frame as ground truth and detect some features using SURF and then for the rest of the video use findHomography for each frame ? Any ideas/help is welcome !
Detecting moving objects from moving camera is a quite challenging task, and requires solid understanding of multiple view geometry, besides there is less info on this topic available (than, for example, about structure from motion), so be warned!
Anyway, homography matrix will not be a good choice for detection of moving objects (unless you are 100% sure that your background can be represented by a flat surface accurately enough). You should probably use a fundamental matrix or trifocal tensor.
Fundamental matrix is computed from point correspondences between 2 frames. It associates points on one image with lines on other image (so called epipolar lines), and this way it is independent from scene structure. After you have obtained F matrix using some robust estimation method, like RANSAC or LMEDS (RANSAC seems to be better for this kind of task), you can calculate the reprojection error for each point. Objects that move independently from scene would not be accurately described by F matrix and will have a bigger error. So, outliers of F matrix calculated from image matches over two frames can be considered moving objects. One note though - objects that move along epipolar lines would not be detected by this approach, since their parallax can be also described by some depth level.
Trifocal tensor does not have the depth/motion ambiguity with objects that move along epipolar lines, but it is harder to estimate and it is not included into OpenCV. It can be calculated from correspondences over 3 frames, and its usage can be conceptually described as triangulating a point from 2 views and then calculating reprojection error on a third view.
As for the matching - I still think that LK tracking will be better than SURF matching if you work with video sequences, since in that case you don't need to consider very distant points as matches, and tracking usually is faster then detection+matching.

I'm trying to detect a shape (a cross) in my input video stream with the help of OpenCV. Currently I'm thresholding to get a binary image of my cross which works pretty good. Unfortunately my algorithm to decide whether the extracted blob is a cross or not doesn't perform very good. As you can see in the image below, not all corners are detected under certain perspectives.
I'm using findContours() and approxPolyDP() to get an approximation of my contour. If I'm detecting 12 corners / vertices in this approximated curve, the blob is assumed to be a cross.
Is there any better way to solve this problem? I thought about SIFT, but the algorithm has to perform in real-time and I read that SIFT is not really suitable for real-time.
I have a couple of suggestions that might provide some interesting results although I am not certain about either.
If the cross is always near the center of your image and always lies on a planar surface you could try to find a homography between the camera and the plane upon which the cross lies. This would enable you to transform a sample image of the cross (at a selection of different in plane rotations) to the coordinate system of the visualized cross. You could then generate templates which you could match to the image. You could do some simple pixel agreement tests to determine if you have a match.
Alternatively you could try to train a Haar-based classifier to recognize the cross. This type of classifier is often used in face detection and detects oriented edges in images, classifying faces by the relative positions of several oriented edges. It has good classification accuracy on faces and is extremely fast. Although I cannot vouch for its accuracy in this particular situation it might provide some good results for simple shapes such as a cross.
Computing the convex hull and then taking advantage of the convexity defects might work.
All crosses should have four convexity defects, making up four sets of two points, or four vectors. Furthermore, if your shape was a cross then these four vectors will have two pairs of supplementary angles.