I need to detect a human in a video in realtime. I guess its not much different from detecting a human in a static image (except that the video image is usually much lower resolution). Can you guys point me in some direction?
I don't have no experience in the computer vision field, so I any link, article, video that could give me a introduction would be useful.
Any help is appreciated.
Thanks.
One of the most famous methods for human detection is the Histogram of Oriented Gradients (HoG) detector. This has been implemented in the OpenCV library and should be a good starting point.
One way is to use HOG features.
This first method seems to be time-consuming
nevertheless it is a very successful human detection
algorithm. The second way is to optimize the HOG
algorithm by resizing the image this method results
in more than two times increase in detecting humans
in a
image while scarifying in the detection
accuracy mainly when persons are on the edges. The
third way consists in adapting Haar feature for
human detection this solution significantly reduces
computational cost in spite of the precision.
To evaluate the proposed method, we have
established a top-view human database.
Experimental results have demonstrated the
effectiveness and efficiency of the proposed
algorithm which gives a good ratio accuracy/time
execution.
Related
In Pose Estimation Using Associative Embedding technique I still don't have clarity regarding How we can group the detected points from HeatMaps to Individual Human Poses using Associative Embeddings Layer. Is there any code that clearly gives Idea regarding this ? I'm Using EfficientHRNet approach for Pose Estimation.
Extracted KeyPoints from Heatmaps and need to group those points into individual poses using Embedding Layer Output.
From OpenVINO perspective, we could offer:
This model: human-pose-estimation-0007
This IE demo: Human Pose Estimation Python* Demo
This model utilized the Associative Embedding technique.
However, if you want to build it from scratch, you'll need to design your own Deep Learning architecture, implement and train the neural network.
This research paper might give you some insight into things that you need to decide (eg batch, optimization algorithm, learning rate, etc).
Has anybody tried developing a SLAM system that uses deep learned features instead of the classical AKAZE/ORB/SURF features?
Scanning recent Computer Vision conferences, there seem to be quite a few reports of successful usage of neural nets to extract features and descriptors, and benchmarks indicate that they may be more robust than their classical computer vision equivalent. I suspect that extraction speed is an issue, but assuming one has a decent GPU (e.g. NVidia 1050), is it even feasible to build a real-time SLAM system running say at 30FPS on 640x480 grayscale images with deep-learned features?
This was a bit too long for a comment, so that's why I'm posting it as an answer.
I think it is feasible, but I don't see how this would be useful. Here is why (please correct me if I'm wrong):
In most SLAM pipelines, precision is more important than long-term robustness. You obviously need your feature detections/matchings to be precise to get reliable triangulation/bundle (or whatever equivalent scheme you might use). However, the high level of robustness that neural networks provide is only required with systems that do relocalization/loop closure on long time intervals (e.g. need to do relocalization in different seasons etc). Even in such scenarios, since you already have a GPU, I think it would be better to use a photometric (or even just geometric) model of the scene for localization.
We don't have any reliable noise models for the features that are detected by the neural networks. I know there have been a few interesting works (Gal, Kendall, etc...) for propagating uncertainties in deep networks, but these methods seem a bit immature for deployment ins SLAM systems.
Deep learning methods are usually good for initializing a system, and the solution they provide needs to be refined. Their results depend too much on the training dataset, and tend to be "hit and miss" in practice. So I think that you could trust them to get an initial guess, or some constraints (e.g. like in the case of pose estimation: if you have a geometric algorithm that drifts in time, then you can use the results of a neural network to constrain them. But I think that the absence of a noise model as mentioned previously will make the fusion a bit difficult here...).
So yes, I think that it is feasible and that you can probably, with careful engineering and tuning produce a few interesting demos, but I wouldn't trust it in real life.
I'm working on a project related to people detection. I successfully implemented both an HOG SVM based classifier (with libSVM) and a cascade classifier (with opencv). The svm classifier works really good, i tested over a number of videos and it is correctly detecting people with really a few false positive and a few false negative; problem here is the computational time: nearly 1.2-1.3 sec over the entire image and 0.2-0.4 sec over the foreground patches; since i'm working on a project that must be able to work in nearly real-time environment, so i switched to the cascade classifier (to get less computational time).
So i trained many different cascade classifiers with opencv (opencv_traincascade). The output is good in terms of computational time (0.2-0.3 sec over the entire image, a lot less when launched only over the foreground), so i achieved the goal, let's say. Problem here is the quality of detection: i'm getting a lot of false positive and a lot of false negative. Since the only difference between the two methods is the base classifier used in opencv (decision tree or decision stumps, anyway no SVM as far as i understand), so i'm starting to think that my problem could be the base classifier (in some way, hog feature are best separated with hyperplanes i guess).
Of course, the dataset used in libsvm and Opencv is exactly the same, both for training and for testing...for the sake of completeness, i used nearly 9 thousands positive samples and nearly 30 thousands negative samples.
Here my two questions:
is it possible to change the base weak learner in the opencv_traincascade function? if yes, it the svm one of the possible choices? if the both answers are yes, how can i do such a thing? :)
are there other computer vision or machine learning libraries that implement the svm as weak classifier and have some methods to train a cascade classifier? (are these libraries suitable to be used in conjuction with opencv?)
thank you in advance as always!
Marco.
In principle a weak classifier can be anything, but the strength of Adaboost related methods is that they are able to obtain good results out of simple classifiers (they are called “weak” for a reason).
Using SVN and Adaboost cascade is a contradiction, as the former has no need to be used in such a framework: it is able to do its job by itself, and the latter is fast just because it takes advantage of weak classifiers.
Furthermore I don’t know of any study about it and OpenCv doesn’t support it: you have to write code by yourself. It is a huge undertaking and probably you won’t get any interesting result.
Anyway if you think that HOG features are more fitted for your task, OpenCv’s traincascade has an option for it, apart from Haar and Lbp.
As to your second question, I’m not sure but quite confident that the answer is negative.
My advice is: try to get the most you can from traincascade, for example try increase the number of samples id you can and compare the results.
This paper is quite good. It simply says that SVM can be treated as a weak classifier if you use fewer samples to train it (let's say less than half of the training set). The higher the weights the more chance it will be trained by the 'weak-SVM'.
The source code is not widely available unfortunately. If you want a quick prototype, use python scikit learn and see if you can get desirable results before modifying opencv.
I'm trying to classify digits read on images at known positions in C++, using SVM.
for that, I sample over a rectangle at the known position of the digit, I train with a ground_truth.
I wonder how to choose the kernel of the SVM. I use the default linear kernel but my intuition tell me that it might not be the best choice.
How could I choose the kernel?
You will need to tune the kernel (if you use a nonlinear one). This guide may be useful for you: A practical guide to SVM classification
Unfortunately there is not a magic bullet for this, so experimentation is your best friend.
Probably I would start with RBF which tends to work decently in most cases, and I am agreed with your intuition that probably linear is not the best, although some times (especially when you have tons of data) it can give you good surprises :)
The problem I have found with RBF is that it tends to overfit the training set, this stop to be an issue if you have a lot of data but then a new problem raises because it tends to scale poorly and having slow training time for big data.
I need 3 underlying papers / most top tree in regard to MEAN SHIFT, OPTICAL FLOW, KALMAN FILTER.
I've searched in ieee xplore, it showed many related papers.
Any idea?
Thanks in advance.
Do you know about CiteSeerX?
For Mean Shift I get Mean shift: A robust approach toward feature space analysis, which is a very good paper on that topic.
For the other topics I cannot help you, but you generally find good papers by reading papers and looking at the references.
These are old unsolved yet classic Computer Vision problems:
Mean Shift
Mean shift: A robust approach toward feature space analysis [same as bjoernz] but in practice, I would prefer a completely different unsupervised segmentation work from Felzenszwalb et al. Efficient graph-based segmentation (faster + better)
Optical Flow
Sparse reliable points: Good Features to track is a nice summary of what is called the KLT literature (for Kanade-Lucas-Tomasi ... poor Jianbo Shi). In a nutshell, some points (corners) in your images are easier to track than others in uniform regions for example.
Dense for each pixel: Horn-Schunk historical paper but check out recent Thomas Brox and Jitendra Malik works and also the one Ce Liu also publish.
Kalman filter: Historical Paper but I do not think it is still cited a lot because everybody seems to refer to their favorite textbooks instead.
For efficient implementations of almost all these nice articles: OpenCV at the rescue!
Caveat: Machine Learning people who are very trendy in Computer Vision these days are sometimes confused by the word features. Indeed, one can distinguish:
Detectors: that selects sparse points in the image ( corners like e.g. Hessian, Harris ...)
Descriptors: that describe these points and also the image through concatenation
Features space: a fancy way to describe their kernel-SVM stuff for recognition
For example, SIFT is both a detector and a descriptor technique although it is called a feature.