ask underlying papers of MEAN SHIFT, OPTICAL FLOW, KALMAN FILTER - computer-vision

I need 3 underlying papers / most top tree in regard to MEAN SHIFT, OPTICAL FLOW, KALMAN FILTER.
I've searched in ieee xplore, it showed many related papers.
Any idea?
Thanks in advance.

Do you know about CiteSeerX?
For Mean Shift I get Mean shift: A robust approach toward feature space analysis, which is a very good paper on that topic.
For the other topics I cannot help you, but you generally find good papers by reading papers and looking at the references.

These are old unsolved yet classic Computer Vision problems:
Mean Shift
Mean shift: A robust approach toward feature space analysis [same as bjoernz] but in practice, I would prefer a completely different unsupervised segmentation work from Felzenszwalb et al. Efficient graph-based segmentation (faster + better)
Optical Flow
Sparse reliable points: Good Features to track is a nice summary of what is called the KLT literature (for Kanade-Lucas-Tomasi ... poor Jianbo Shi). In a nutshell, some points (corners) in your images are easier to track than others in uniform regions for example.
Dense for each pixel: Horn-Schunk historical paper but check out recent Thomas Brox and Jitendra Malik works and also the one Ce Liu also publish.
Kalman filter: Historical Paper but I do not think it is still cited a lot because everybody seems to refer to their favorite textbooks instead.
For efficient implementations of almost all these nice articles: OpenCV at the rescue!
Caveat: Machine Learning people who are very trendy in Computer Vision these days are sometimes confused by the word features. Indeed, one can distinguish:
Detectors: that selects sparse points in the image ( corners like e.g. Hessian, Harris ...)
Descriptors: that describe these points and also the image through concatenation
Features space: a fancy way to describe their kernel-SVM stuff for recognition
For example, SIFT is both a detector and a descriptor technique although it is called a feature.

Related

Pose Estimation Using Associative Embedding technique

In Pose Estimation Using Associative Embedding technique I still don't have clarity regarding How we can group the detected points from HeatMaps to Individual Human Poses using Associative Embeddings Layer. Is there any code that clearly gives Idea regarding this ? I'm Using EfficientHRNet approach for Pose Estimation.
Extracted KeyPoints from Heatmaps and need to group those points into individual poses using Embedding Layer Output.
From OpenVINO perspective, we could offer:
This model: human-pose-estimation-0007
This IE demo: Human Pose Estimation Python* Demo
This model utilized the Associative Embedding technique.
However, if you want to build it from scratch, you'll need to design your own Deep Learning architecture, implement and train the neural network.
This research paper might give you some insight into things that you need to decide (eg batch, optimization algorithm, learning rate, etc).

Problems in computer vision that use optimization method in graph theory?

I am supposed to give a presentation on optimization algorithms on graphs. On the other hand, I am also very interested in computer vision. And I hope to combine these two in my presentation. Can you suggest some topics in computer vision which are solved by optimization methods in graph theory (e.g. shortest-path, maximum flow, matching, etc.)? The newer the better.
There was an enormous amount of work done in the late '90s and early 00's using graph-cut methods in Computer Vision. This is a good starting point: https://en.wikipedia.org/wiki/Graph_cuts_in_computer_vision

SLAM system that uses deep learned features?

Has anybody tried developing a SLAM system that uses deep learned features instead of the classical AKAZE/ORB/SURF features?
Scanning recent Computer Vision conferences, there seem to be quite a few reports of successful usage of neural nets to extract features and descriptors, and benchmarks indicate that they may be more robust than their classical computer vision equivalent. I suspect that extraction speed is an issue, but assuming one has a decent GPU (e.g. NVidia 1050), is it even feasible to build a real-time SLAM system running say at 30FPS on 640x480 grayscale images with deep-learned features?
This was a bit too long for a comment, so that's why I'm posting it as an answer.
I think it is feasible, but I don't see how this would be useful. Here is why (please correct me if I'm wrong):
In most SLAM pipelines, precision is more important than long-term robustness. You obviously need your feature detections/matchings to be precise to get reliable triangulation/bundle (or whatever equivalent scheme you might use). However, the high level of robustness that neural networks provide is only required with systems that do relocalization/loop closure on long time intervals (e.g. need to do relocalization in different seasons etc). Even in such scenarios, since you already have a GPU, I think it would be better to use a photometric (or even just geometric) model of the scene for localization.
We don't have any reliable noise models for the features that are detected by the neural networks. I know there have been a few interesting works (Gal, Kendall, etc...) for propagating uncertainties in deep networks, but these methods seem a bit immature for deployment ins SLAM systems.
Deep learning methods are usually good for initializing a system, and the solution they provide needs to be refined. Their results depend too much on the training dataset, and tend to be "hit and miss" in practice. So I think that you could trust them to get an initial guess, or some constraints (e.g. like in the case of pose estimation: if you have a geometric algorithm that drifts in time, then you can use the results of a neural network to constrain them. But I think that the absence of a noise model as mentioned previously will make the fusion a bit difficult here...).
So yes, I think that it is feasible and that you can probably, with careful engineering and tuning produce a few interesting demos, but I wouldn't trust it in real life.

choosing kernel for digit recognition in C

I'm trying to classify digits read on images at known positions in C++, using SVM.
for that, I sample over a rectangle at the known position of the digit, I train with a ground_truth.
I wonder how to choose the kernel of the SVM. I use the default linear kernel but my intuition tell me that it might not be the best choice.
How could I choose the kernel?
You will need to tune the kernel (if you use a nonlinear one). This guide may be useful for you: A practical guide to SVM classification
Unfortunately there is not a magic bullet for this, so experimentation is your best friend.
Probably I would start with RBF which tends to work decently in most cases, and I am agreed with your intuition that probably linear is not the best, although some times (especially when you have tons of data) it can give you good surprises :)
The problem I have found with RBF is that it tends to overfit the training set, this stop to be an issue if you have a lot of data but then a new problem raises because it tends to scale poorly and having slow training time for big data.

The pro and con of BRIEF and ORB compared to SIFT

I am doing some research in Local Feature representation, so SIFT, SURF and such.
Now, has anybody here ever tried BRIEF and ORB? If so, can you discuss what are some of the pro and con with respective to SIFT?
here is one comparison I have found helpful. Essentially BRIEF and ORB are much faster. There is not a good comparison of scale invariance there but personally I have found SURF/SIFT to be more scale invariant than BRIEF and ORB. I recommend if you are going to use these for a specific use case you try both to see which meets your needs best.
SURF/SIFT uses patents which need to be payed somehow. I'm not up to date on this but the costs can be significant. So i would go with ORB if possible - except of course if you don't care about money :)
SIFT: The algorithm is patented in the US; the owner is the University of British Columbia. (http://en.wikipedia.org/wiki/Scale-invariant_feature_transform)
SURF: An application of the algorithm is patented in the US. (http://en.wikipedia.org/wiki/SURF)