I'm trying to create a neural network in C++ with OpenCV. The aim is recognition of road signs. I have created the network in this way, but it predicts badly, because it returns strange results:
Sample images from the training selection look like this:
Can someone help?
trainNN() {
char* templates_directory[] = {
"speed50ver1\\",
"speed60ver1\\",
"speed70ver1\\",
"speed80ver1\\"
};
int const numFilesChars[]={ 213, 100, 385, 163};
char const strCharacters[] = { '5', '6', '7', '8' };
Mat trainingData;
Mat trainingLabels(0, 0, CV_32S);
int const numCharacters = 4;
// load images from directory
for (int i = 0; i != numCharacters; ++i) {
int numFiles = numFilesChars[i];
DIR *dir;
struct dirent *ent;
char* s1 = templates_directory[i];
if ((dir = opendir (s1)) != NULL) {
Size size(80, 80);
while ((ent = readdir (dir)) != NULL) {
string s = s1;
s.append(ent->d_name);
if(s.substr(s.find_last_of(".") + 1) == "jpg") {
Mat img = imread(s,0);
Mat img_mat;
resize(img, img_mat, size);
Mat new_img = img_mat.reshape(1, 1);
trainingData.push_back(new_img);
trainingLabels.push_back(i);
}
}
int b = 0;
closedir (dir);
} else {
/* could not open directory */
perror ("");
}
}
trainingData.convertTo(trainingData, CV_32FC1);
Mat trainClasses(trainingData.rows, numCharacters, CV_32FC1);
for( int i = 0; i != trainClasses.rows; ++i){
int const labels = *trainingLabels.ptr<int>(i);
auto train_ptr = trainClasses.ptr<float>(i);
for(int k = 0; k != trainClasses.cols; ++k){
*train_ptr = k != labels ? 0 : 1;
++train_ptr;
}
}
int layers_d[] = { trainingData.cols, 10, numCharacters};
Mat layers(1, 3, CV_32SC1, layers_d);
ann.create(layers, CvANN_MLP::SIGMOID_SYM, 1, 1);
CvANN_MLP_TrainParams params = CvANN_MLP_TrainParams(
// terminate the training after either 1000
// iterations or a very small change in the
// network wieghts below the specified value
cvTermCriteria(CV_TERMCRIT_ITER+CV_TERMCRIT_EPS, 1000, 0.000001),
// use backpropogation for training
CvANN_MLP_TrainParams::BACKPROP,
// co-efficents for backpropogation training
// (refer to manual)
0.1,
0.1);
int iterations = ann.train(trainingData, trainClasses, cv::Mat(), cv::Mat(), params);
CvFileStorage* storage = cvOpenFileStorage( "neural_network_2.xml", 0, CV_STORAGE_WRITE );
ann.write(storage,"digit_recognition");
cvReleaseFileStorage(&storage);
}
void analysis(char* file, bool a) {
//trainNN(a);
read_nn();
// load image
Mat img = imread(file, 0);
Size my_size(80,80);
resize(img, img, my_size);
Mat r_img = img.reshape(1,1);
r_img.convertTo(r_img, CV_32FC1);
Mat classOut(1,4,CV_32FC1);
ann.predict(r_img, classOut);
double min1, max1;
cv::Point min_loc, max_loc;
minMaxLoc(classOut, &min1, &max1, &min_loc, &max_loc);
int x = max_loc.x;
//create windows
namedWindow("Original Image", CV_WINDOW_AUTOSIZE);
imshow("Original Image", img);
waitKey(0); //wait for key press
img.release();
rr.release();
destroyAllWindows(); //destroy all open windows
}
strange results: for this input answer is 3 (because i have only 4 classes - speed limit 50, 60, 70, 80). It's correct for speed limit 80 sign.
But for the rest inputs results are incorrect. They are same for signs 50, 60, 70. max1 = min1 = 1.02631...(as on the first picture) It's strange.
I have adapted your code to train a classifier on 4 hand positions (since that's the image data I have). I kept your logic as similar as possible, only changing what was absolutely necessary to make it run on my Windows machine on my images. Long story short, there is nothing fundamentally wrong with your code - I don't see the failure mode you described.
One thing you left out was the code for read_nn(). I assume that just does something like the following:
ann.load("neural_network_2.xml");
Anyway, my suspicion is that either your neural network is not converging at all or it's badly overfitting. Perhaps there's not enough variation in the training data. Are you running analysis() on separate test data that the ANN wasn't trained on? If so, is the ANN able to predict training data properly at least?
EDIT: OK, I just downloaded your image data and tried it out and saw the same behavior. After some analysis, it looks like your ANN is not converging. The training operation exits after only about 250 iterations, even if you specify only CV_TERMCRIT_ITER for the cvTermCriteria. After increasing your hidden layer size from 10 to 20, I saw a marked improvement, with successful classification on the training data for 212, 72, 94, and 143 of the images respectively to the classes (50, 60, 70, and 80). That's not very good, but it demonstrates that you're on the right track.
Basically, the network architecture is not expressive enough to adequately model the problem you're trying to solve, so the network weights never converge and it abandons the backprop early. For one class, you may see some success, but I believe that's largely a function of the lack of shuffling of training data. If it stops after having just trained on a couple hundred very similar images, it may be able to manage to classify those correctly.
In short, I would recommend doing the following:
Build a way to test the results - e.g.: create a function to run prediction on all training data, and ideally set aside some images as a validation set in order to also confirm that the model is not overfitting the training data.
Shuffle the training data prior to training. Otherwise, backprop will not converge as easily.
Experiment with different architectures such as more than one hidden layer with varying sizes.
Really, this is a problem that would benefit dramatically from using a Convolutional Neural Net, but OpenCV's machine learning facilities are pretty limited. Ultimately, if you're serious about creating ANNs, you might want to investigate some more robust tools. I personally use Tensorflow, but I've heard good things about Theano as well.
I've only implemented NN with OpenCV for boolean classification, but I think that for a task where you need to classify more than two distinct classes this might also apply:
"If you are using the default cvANN_MLP::SIGMOID_SYM activation function then the output should be in the range [-1,1], instead of [0,1], for optimal results."
So, where you do:
*train_ptr = k != labels ? 0 : 1;
You might want to try:
*train_ptr = k != labels ? -1 : 1;
Disregard if I'm way off track here.
Related
In my project I am calculating HOG features on GPU for different levels in the same image. My aim is to detect the following objects.
1. Truck
2. Car
3. Person
Most important question is the selection of window size in case of multi class object detector. This post provide a very good base but it does not provide an answer for the selection of window size in case of multi class feature.
To solve this problem I calculated the HOG features of each positive image at different levels/resolution keeping the window size(48*96) same but the file for each image is around 600 MB which is too large.
Please let me know how to select the window size, block size and cell size in case of multi class object detection. Here is my code that I used to calculate the HOG features.
void App::run()
{
unsigned int count = 1;
FileStorage fs;
running = true;
//int width;
//int height;
Size win_size(args.win_width, args.win_width * 2);
Size win_stride(args.win_stride_width, args.win_stride_height);
cv::gpu::HOGDescriptor gpu_hog(win_size, Size(16, 16), Size(8, 8), Size(8, 8), 9,
cv::gpu::HOGDescriptor::DEFAULT_WIN_SIGMA, 0.2, gamma_corr,
cv::gpu::HOGDescriptor::DEFAULT_NLEVELS);
VideoCapture vc("/home/ubuntu/Desktop/getdescriptor/images/image%d.jpg");
Mat frame;
Mat Left;
Mat img_aux, img, img_to_show, img_new;
cv::Mat temp;
gpu::GpuMat gpu_img, descriptors, new_img;
char cbuff[20];
while (running)
{
vc.read(frame);
if (!frame.empty())
{
workBegin();
width = frame.rows;
height = frame.cols;
sprintf (cbuff, "%04d", count);
// Change format of the image
if (make_gray) cvtColor(frame, img_aux, CV_BGR2GRAY);
else if (use_gpu) cvtColor(frame, img_aux, CV_BGR2BGRA);
else Left.copyTo(img_aux);
// Resize image
if (args.resize_src) resize(img_aux, img, Size(args.width, args.height));
else img = img_aux;
img_to_show = img;
gpu_hog.nlevels = nlevels;
hogWorkBegin();
if (use_gpu)
{
gpu_img.upload(img);
new_img.upload(img_new);
fs.open(cbuff, FileStorage::WRITE);
for(int levels = 0; levels < nlevels; levels++)
{
gpu_hog.getDescriptors(gpu_img, win_stride, descriptors, cv::gpu::HOGDescriptor::DESCR_FORMAT_ROW_BY_ROW);
descriptors.download(temp);
//printf("size %d %d\n", temp.rows, temp.cols);
fs <<"level" << levels;
fs << "features" << temp;
cout<<"("<<width<<","<<height<<")"<<endl;
width = round(width/scale);
height = round(height/scale);
if( width < win_size.width || height < win_size.height )
break;
cout<<"Levels "<<levels<<endl;
resize(img,img_new,Size(width,height));
scale *= scale;
}
cout<<count<< " Image feature calculated !"<<endl;
count++;
//width = 640; height = 480;
scale = 1.05;
}
hogWorkEnd();
fs.release();
}
else running = false;
}
}
The window size should be chosen, s.t. the object(s) you want to detect fit into the window. If you want to have different window sizes for different types this might become tricky.
Usually what you do is the following
Take training data for each type of objects, and train [number of object types] many models using the features extracted at the known position of the objects.
Then you take each test image and use a sliding window approach to extract features at each location. These features are then compared to each model. If one of the models lead to a score higher than a certain threshold you have found this object. If more than one model scores higher than the threshold simply take the one scoring highest.
If you want to use differently sized detection windows you will get feature vectors of different size (by nature of the HoG features). The tricky thing is, that in the testing phase you have to use as many sliding windows as object types you use. This would definitely work, but you have to process each testing image several times leading to higher processing time)
To answer your question of the sizes: There is no value I can give you, it always depends on your images. Using an image pyramid as you mentioned above is a good way to deal with differently scaled objects.
window size: the whole object should fit in; has to be divisible by block size
block size has to be divisible by cell size
Sample code for visualization of HoG features can be found here. This also helps understand how the feature vectors look like.
EDIT: Found out the hard way, that only cv::Size(8,8) is allowed for cell size. See documentation.
I have a dataset of images , and i want to group my images into different groups based on content. What i have tried till now is find median of images and thought to group them into different clusters based on median values , How can i do that? This is what i have tried till now. How can i do cluster my images into groups? I did Google out many things on clustering, But it showed results on clustering based on colors rather than clustering images into groups.Can anyone provide me with informative answers?Can i automatically cluster my dataset into groups based on median or some other technique?
from PIL import Image
import numpy as np
import os
Median=[]
k=[]
def get_imlist(path):
return [os.path.join(path,f) for f in os.listdir(path) if f.endswith('.jpg')]
path='D:/Images/dataset'
imlist= get_imlist(path)
for file in imlist:
head,tail=os.path.split(file)
im=np.array(Image.open(file).convert('L'))
m=np.median(im)
M=[m,tail]
print '.'
Median.append(M)
Results=sorted(Median, key=lambda median: median[0])
print Results
k-means is a common method for clustering and is in OpenCV http://docs.opencv.org/modules/core/doc/clustering.html.
Before you cluster it is recommended that you use a representation that has a lower number of dimensions than the full n*m set of pixels. This is for two main reasons, robustness to noise, and the reduction of computational cost of the clustering process. The choice of representation may be critical to the perceived quality of the clusters, and will largely depend on your application. My current favorite is the GIST descriptor (c++: http://lear.inrialpes.fr/software, matlab: http://people.csail.mit.edu/torralba/code/spatialenvelope/). However that is not in OpenCV. So here i will use a gray level histogram, thus reducing the dimensions from m*n to b = no. of bins.
Assuming a vector of gray level input images named frames.
//set up histogram
int histSize = 128;
float range[] = { 0, histSize } ;
const float* histRange = { range };
bool uniform = true; bool accumulate = false;
Mat_<float> dataHists;
cv::Mat grayImg;
Mat hist_i;
for(int i=0; i <frames.size(); i++)
{
grayImg =frames[i];
//histogram gray image
calcHist( &grayImg, 1, 0, Mat(), hist_i, 1, &histSize, &histRange, uniform, accumulate );
normalize(hist_i, hist_i, 0, hist_i.rows, NORM_MINMAX, -1, Mat() );
//transpose for feature vector
hist_i = hist_i.t();
//add to feature vectors for k-means
dataHists.push_back(cv::Mat(hist_i));
}
//k-means
int k = 100;
cv::Mat bestLabels;
cv::kmeans(dataHists,k,bestLabels,TermCriteria(),3,KMEANS_PP_CENTERS);
//have a look
vector<cv::Mat> clusterViz(bestLabels.rows);
for(int i=0;i<bestLabels.rows; i++)
{
clusterViz[bestLabels.at<int>(i)].push_back(cv::Mat(frames[bestLabels.at<int>(i)]));
}
namedWindow("clusters", WINDOW_NORMAL );
for(int i=0;i<clusterViz.size(); i++)
{
cv::imshow("clusters",clusterViz[i]);
cv::waitKey();
}
Hope this helps you.
I'm attempting to work with a depth sensor to add positional tracking to the Oculus Rift dev kit. However, I'm having trouble with the sequence of operations producing a usable result.
I'm starting with a 16 bit depth image, where the values sort of (but not really) correspond to millimeters. Undefined values in the image have already been set to 0.
First I'm eliminating everything outside a certain near and far distance by updating a mask image to exclude them.
cv::Mat result = cv::Mat::zeros(depthImage.size(), CV_8UC3);
cv::Mat depthMask;
depthImage.convertTo(depthMask, CV_8U);
for_each_pixel<DepthImagePixel, uint8_t>(depthImage, depthMask,
[&](DepthImagePixel & depthPixel, uint8_t & maskPixel){
if (!maskPixel) {
return;
}
static const uint16_t depthMax = 1200;
static const uint16_t depthMin = 200;
if (depthPixel < depthMin || depthPixel > depthMax) {
maskPixel = 0;
}
});
Next, since the feature I want is likely to be closer to the camera than the overall scene average, I update the mask again to exclude anything that isn't within a certain range of the median value:
const float depthAverage = cv::mean(depthImage, depthMask)[0];
const uint16_t depthMax = depthAverage * 1.0;
const uint16_t depthMin = depthAverage * 0.75;
for_each_pixel<DepthImagePixel, uint8_t>(depthImage, depthMask,
[&](DepthImagePixel & depthPixel, uint8_t & maskPixel){
if (!maskPixel) {
return;
}
if (depthPixel < depthMin || depthPixel > depthMax) {
maskPixel = 0;
}
});
Finally, I zero out everything that's not in the mask, and scale the remaining values to between 10 & 255 before converting the image format to 8 bit
cv::Mat outsideMask;
cv::bitwise_not(depthMask, outsideMask);
// Zero out outside the mask
cv::subtract(depthImage, depthImage, depthImage, outsideMask);
// Within the mask, normalize to the range + X
cv::subtract(depthImage, depthMin, depthImage, depthMask);
double minVal, maxVal;
minMaxLoc(depthImage, &minVal, &maxVal);
float range = depthMax - depthMin;
float scale = (((float)(UINT8_MAX - 10) / range));
depthImage *= scale;
cv::add(depthImage, 10, depthImage, depthMask);
depthImage.convertTo(depthImage, CV_8U);
The results looks like this:
I'm pretty happy with this section of the code, since it produces pretty clear visual features.
I'm then applying a couple of smoothing operations to get rid of the ridiculous amount of noise from the depth camera:
cv::medianBlur(depthImage, depthImage, 9);
cv::Mat blurred;
cv::bilateralFilter(depthImage, blurred, 5, 250, 250);
depthImage = blurred;
cv::Mat result = cv::Mat::zeros(depthImage.size(), CV_8UC3);
cv::insertChannel(depthImage, result, 0);
Again, the features look pretty clear visually, but I wonder if they couldn't be sharpened somehow:
Next I'm using canny for edge detection:
cv::Mat canny_output;
{
cv::Canny(depthImage, canny_output, 20, 80, 3, true);
cv::insertChannel(canny_output, result, 1);
}
The lines I'm looking for are there, but not well represented towards the corners:
Finally I'm using probabilistic Hough to identify lines:
std::vector<cv::Vec4i> lines;
cv::HoughLinesP(canny_output, lines, pixelRes, degreeRes * CV_PI / 180, hughThreshold, hughMinLength, hughMaxGap);
for (size_t i = 0; i < lines.size(); i++)
{
cv::Vec4i l = lines[i];
glm::vec2 a((l[0], l[1]));
glm::vec2 b((l[2], l[3]));
float length = glm::length(a - b);
cv::line(result, cv::Point(l[0], l[1]), cv::Point(l[2], l[3]), cv::Scalar(0, 0, 255), 3, CV_AA);
}
This results in this image
At this point I feel like I've gone off the rails, because I can't find a good set of parameters for Hough to produce a reasonable number of candidate lines in which to search for my shape, and I'm not sure if I should be fiddling with Hough or looking at improving the outputs of the prior steps.
Is there a good way of objectively validating my results at each stage, as opposed to just fiddling with the input values until I think it 'looks good'? Is there a better approach to finding the rectangle given the starting image (and given that it won't necessarily be oriented in a particular direction?
Very cool project!
Though, I feel like your approach does not use all the info that you could get from the depthmap (e.g. 3D points, normals, etc), which would help a lot.
The Point Cloud Library (PCL), which is a C++ library dedicated to the processing of RGB-D data, has a tutorial on plane segmentation using RANSAC which could inspire you. You might not want to use PCL in your program, due to the numerous dependencies, however as it is open-source, you can find the algorithm implementation on Github (PCL SAC segmentation). However, RANSAC might be slow and produce unwanted results depending on the scene.
You could also try to use the approach presented in "Real-Time Plane Segmentation
using RGB-D Cameras" by Holz, Holzer, Rusu and Behnke, 2011 (PDF), which suggests fast normal estimation using integral images followed by plane detection using clustering of normals.
This is my code for training the dataset of for example vehicles , when it train fully , i want it to predict the data(vehicle) from video(.avi) , how to predict trained data from video and how to add that part in it ? , i want that when the vehicle is shown in the video it count it as 1 and cout that the object is detected and if second vehicle come it increment the count as 2
IplImage *img2;
cout<<"Vector quantization..."<<endl;
collectclasscentroids();
vector<Mat> descriptors = bowTrainer.getDescriptors();
int count=0;
for(vector<Mat>::iterator iter=descriptors.begin();iter!=descriptors.end();iter++)
{
count += iter->rows;
}
cout<<"Clustering "<<count<<" features"<<endl;
//choosing cluster's centroids as dictionary's words
Mat dictionary = bowTrainer.cluster();
bowDE.setVocabulary(dictionary);
cout<<"extracting histograms in the form of BOW for each image "<<endl;
Mat labels(0, 1, CV_32FC1);
Mat trainingData(0, dictionarySize, CV_32FC1);
int k = 0;
vector<KeyPoint> keypoint1;
Mat bowDescriptor1;
//extracting histogram in the form of bow for each image
for(j = 1; j <= 4; j++)
for(i = 1; i <= 60; i++)
{
sprintf( ch,"%s%d%s%d%s","train/",j," (",i,").jpg");
const char* imageName = ch;
img2 = cvLoadImage(imageName, 0);
detector.detect(img2, keypoint1);
bowDE.compute(img2, keypoint1, bowDescriptor1);
trainingData.push_back(bowDescriptor1);
labels.push_back((float) j);
}
//Setting up SVM parameters
CvSVMParams params;
params.kernel_type = CvSVM::RBF;
params.svm_type = CvSVM::C_SVC;
params.gamma = 0.50625000000000009;
params.C = 312.50000000000000;
params.term_crit = cvTermCriteria(CV_TERMCRIT_ITER, 100, 0.000001);
CvSVM svm;
printf("%s\n", "Training SVM classifier");
bool res = svm.train(trainingData, labels, cv::Mat(), cv::Mat(), params);
cout<<"Processing evaluation data..."<<endl;
Mat groundTruth(0, 1, CV_32FC1);
Mat evalData(0, dictionarySize, CV_32FC1);
k = 0;
vector<KeyPoint> keypoint2;
Mat bowDescriptor2;
Mat results(0, 1, CV_32FC1);;
for(j = 1; j <= 4; j++)
for(i = 1; i <= 60; i++)
{
sprintf( ch, "%s%d%s%d%s", "eval/", j, " (",i,").jpg");
const char* imageName = ch;
img2 = cvLoadImage(imageName,0);
detector.detect(img2, keypoint2);
bowDE.compute(img2, keypoint2, bowDescriptor2);
evalData.push_back(bowDescriptor2);
groundTruth.push_back((float) j);
float response = svm.predict(bowDescriptor2);
results.push_back(response);
}
//calculate the number of unmatched classes
double errorRate = (double) countNonZero(groundTruth- results) / evalData.rows;
The question isThis code is not predicting from video , i want to know how to predict it from the video , mean like i want to detect the vehicle from movie , like it should show 1 when it find the vehicle from movie
For those who didn't understand the question :
I want to play a movie in above code
VideoCapture cap("movie.avi"); //movie.avi is with deleted background
Suppose i have a trained data which contain vehicle's , and "movie.avi" contain 5 vehicles , so it should detect that vehicles from the movie.avi and give me 5 as output
How to do this part in the above code
From looking at your code setup
params.svm_type = CvSVM::C_SVC;
it appears that you train your classifier with more than two classes. A typical example in traffic scenario could be cars/pedestrians/bikes/... However, you were asking for a way to detect cars only. Without a description of your training data and your video it's hard to tell, if your idea makes sense. I guess what the previous answers are assuming is the following:
You loop through each frame and want to output the number of cars in that frame. Thus, a frame may contain multiple cars, say 5. If you take the whole frame as input to the classifier, it might respond "car", even if the setup might be a little off, conceptually. You cannot retrieve the number of cars reliably with this approach.
Instead, the suggestion is to try a sliding-window approach. This means, for example, you loop over each pixel of the frame and take the region around the pixel (called sub-window or region-of-interest) as input to the classifier. Assuming a fixed scale, the sub-window could have a size of 150x50px as well as your training data would. You might fixate the scale of the cars in your training data, but in real-world videos, the cars will be of different size. In order to find a car of different scale, let's say it's two-times as large as in the training data, the typical approach is to scale the image (say with a factor of 2) and repeat the sliding-window approach.
By repeating this for all relevant scales you end up with an algorithm that gives you for each pixel location and each scale the result of your classifier. This means you have three loops, or, in other words, there are three dimensions (image width, image height, scale). This is best understood as a three-dimensional pyramid. "Why a pyramid?" you might ask. Because each time the image is scaled (say 2) the image gets smaller (/larger) and the next scale is an image of different size (for eample half the size).
The pixel locations indicate the position of the car and the scale indicates the size of it. Now, if you have an N-class classifier, each slot in this pyramid will contain a number (1,...,N) indicating the class. If you had a binary classifier (car/no car), then you would end up with each slot containing 0 or 1. Even in this simple case, where you would be tempted to simply count the number of 1 and output the count as the number of cars, you still have the problem that there could be multiple responses for the same car. Thus, it would be better if you had a car detector that gives continous responses between 0 and 1 and then you could find maxima in this pyramid. Each maximum would indicate a single car. This kind of detection is successfully used with corner features, where you detect corners of interest in a so-called scale-space pyramid.
To summarize, no matter if you are simplifying the problem to a binary classification problem ("car"/"no car"), or if you are sticking to the more difficult task of distinguishing between multiple classes ("car"/"animal"/"pedestrian"/...), you still have the problem of scale and location in each frame to solve.
The code you have for using images is written with OpenCV's C interface so it's probably easy to stick with that rather than use the C++ video interface.
In which case somthing along these lines should work:
CvCapture *capture = cvCaptureFromFile("movie.avi");
IplImage *img = 0;
while(img = cvQueryFrame(capture))
{
// Process image
...
}
You should implement a sliding window approach. In each window, you should apply the SVM to get candidates. Then, once you've done it for the whole image, you should merge the candidates (if you detected an object, then it is very likely that you'll detect it again in shift of a few pixels - that's the meaning of candidates).
Take a look at the V&J code at openCV or the latentSVM code (detection by parts) to see how it's done there.
By the way, I would use the LatentSVM code (detection by parts) to detect vehicles. It has trained models for cars and for buses.
Good luck.
You need detector, not classifier. Take a look at Haar cascades, LBP cascades, latentSVM, as mentioned before or HOG detector.
I'll explain why. Detector usually scan image by sliding window, line by line. In several scales. In every window detector solve problem: "object/not object". It may give you rough results but very fast. Classifiers such as BOW, works very slow for this task. Then you should apply classifiers to regions, found by detector.
I have a project, which I want to detect objects in the images; my aim is to use HOG features. By using OpenCV SVM implementation , I could find the code for detecting people, and I read some papers about tuning the parameters in order to detect object instead of people. Unfortunately, I couldn't do that for a few reasons; first of all, I am probably tuning the parameters incorrectly, second of all, I am not a good programmer in C++ but I have to do it with C++/OpenCV... here you can find the code for detecting HOG features for people by using C++/OpenCV.
Let's say that I want to detect the object in this image. Now, I will show you what I have tried to change in the code but it didn't work out with me.
The code that I tried to change:
HOGDescriptor hog;
hog.setSVMDetector(HOGDescriptor::getDefaultPeopleDetector());
I tried to change getDefaultPeopleDetector() with the following parameters, but it didn't work:
(Size(64, 128), Size(16, 16), Size(8, 8), Size(8, 8), 9, 0,-1, 0, 0.2, true, cv::HOGDescriptor::DEFAULT_NLEVELS)
I then tried to make a vector, but when I wanted to print the results, it seems to be empty.
vector<float> detector;
HOGDescriptor hog(Size(64, 128), Size(16, 16), Size(8, 8), Size(8, 8), 9, 0,-1, 0, 0.2, true, cv::HOGDescriptor::DEFAULT_NLEVELS);
hog.setSVMDetector(detector);
Please, I need help solving this problem.
In order to detect arbitrary objects with using opencv HOG descriptors and SVM classifier, you need to first train the classifier. Playing with the parameters will not help here, sorry :( .
In broad terms, you will need to complete the following steps:
Step 1) Prepare some training images of the objects you want to detect (positive samples). Also you will need to prepare some images with no objects of interest (negative samples).
Step 2) Detect HOG features of the training sample and use this features to train an SVM classifier (also provided in OpenCV).
Step 3) Use the coefficients of the trained SVM classifier in HOGDescriptor::setSVMDetector() method.
Only then, you can use the peopledetector.cpp sample code, to detect the objects you want to detect.
I've been dealing with the same problem and surprised with the lack of some clean C++ solutions I have create ~> this wrapper of SVMLight <~, which is a static library that provides classes SVMTrainer and SVMClassifier that simplify the training to something like:
// we are going to use HOG to obtain feature vectors:
HOGDescriptor hog;
hog.winSize = Size(32,48);
// and feed SVM with them:
SVMLight::SVMTrainer svm("features.dat");
then for each training sample:
// obtain feature vector describing sample image:
vector<float> featureVector;
hog.compute(img, featureVector, Size(8, 8), Size(0, 0));
// and write feature vector to the file:
svm.writeFeatureVectorToFile(featureVector, true); // true = positive sample
till the features.dat file contains feature vectors for all samples and at the end you just call:
std::string modelName("classifier.dat");
svm.trainAndSaveModel(modelName);
Once you have a file with model (or features.dat that you can just train the classifier with):
SVMLight::SVMClassifier c(classifierModelName);
vector<float> descriptorVector = c.getDescriptorVector();
hog.setSVMDetector(descriptorVector);
...
vector<Rect> found;
Size padding(Size(0, 0));
Size winStride(Size(8, 8));
hog.detectMultiScale(segment, found, 0.0, winStride, padding, 1.01, 0.1);
just check the documentation of HOGDescriptor for more info :)
I have done similar things as you did: collect samples of positive and negative images using HOG to extract features of car, train the feature set using linear SVM (I use SVM light), then use the model to detect car using HOG multidetect function.
I get lot of false positives, then I retrain the data using positive samples and false positive+negative samples. The resulting model is then tested again. The resulting detection improves (less false positives) but the result is not satisfying (average 50% hit rate and 50% false positives). Tuning up multidetect parameters improve the result but not much (10% less false positives and increase in hit rate).
Edit
I can share you the source code if you'd like, and I am very open for discussion as I have not get satisfactory results using HOG. Anyway, I think the code can be good starting point on using HOG for training and detection
Edit: adding code
static void calculateFeaturesFromInput(const string& imageFilename, vector<float>& featureVector, HOGDescriptor& hog)
{
Mat imageData = imread(imageFilename, 1);
if (imageData.empty()) {
featureVector.clear();
printf("Error: HOG image '%s' is empty, features calculation skipped!\n", imageFilename.c_str());
return;
}
// Check for mismatching dimensions
if (imageData.cols != hog.winSize.width || imageData.rows != hog.winSize.height) {
featureVector.clear();
printf("Error: Image '%s' dimensions (%u x %u) do not match HOG window size (%u x %u)!\n", imageFilename.c_str(), imageData.cols, imageData.rows, hog.winSize.width, hog.winSize.height);
return;
}
vector<Point> locations;
hog.compute(imageData, featureVector, winStride, trainingPadding, locations);
imageData.release(); // Release the image again after features are extracted
}
...
int main(int argc, char** argv) {
// <editor-fold defaultstate="collapsed" desc="Init">
HOGDescriptor hog; // Use standard parameters here
hog.winSize.height = 128;
hog.winSize.width = 64;
// Get the files to train from somewhere
static vector<string> tesImages;
static vector<string> positiveTrainingImages;
static vector<string> negativeTrainingImages;
static vector<string> validExtensions;
validExtensions.push_back("jpg");
validExtensions.push_back("png");
validExtensions.push_back("ppm");
validExtensions.push_back("pgm");
// </editor-fold>
// <editor-fold defaultstate="collapsed" desc="Read image files">
getFilesInDirectory(posSamplesDir, positiveTrainingImages, validExtensions);
getFilesInDirectory(negSamplesDir, negativeTrainingImages, validExtensions);
/// Retrieve the descriptor vectors from the samples
unsigned long overallSamples = positiveTrainingImages.size() + negativeTrainingImages.size();
// </editor-fold>
// <editor-fold defaultstate="collapsed" desc="Calculate HOG features and save to file">
// Make sure there are actually samples to train
if (overallSamples == 0) {
printf("No training sample files found, nothing to do!\n");
return EXIT_SUCCESS;
}
/// #WARNING: This is really important, some libraries (e.g. ROS) seems to set the system locale which takes decimal commata instead of points which causes the file input parsing to fail
setlocale(LC_ALL, "C"); // Do not use the system locale
setlocale(LC_NUMERIC,"C");
setlocale(LC_ALL, "POSIX");
printf("Reading files, generating HOG features and save them to file '%s':\n", featuresFile.c_str());
float percent;
/**
* Save the calculated descriptor vectors to a file in a format that can be used by SVMlight for training
* #NOTE: If you split these steps into separate steps:
* 1. calculating features into memory (e.g. into a cv::Mat or vector< vector<float> >),
* 2. saving features to file / directly inject from memory to machine learning algorithm,
* the program may consume a considerable amount of main memory
*/
fstream File;
File.open(featuresFile.c_str(), ios::out);
if (File.good() && File.is_open()) {
File << "# Use this file to train, e.g. SVMlight by issuing $ svm_learn -i 1 -a weights.txt " << featuresFile.c_str() << endl; // Remove this line for libsvm which does not support comments
// Iterate over sample images
for (unsigned long currentFile = 0; currentFile < overallSamples; ++currentFile) {
storeCursor();
vector<float> featureVector;
// Get positive or negative sample image file path
const string currentImageFile = (currentFile < positiveTrainingImages.size() ? positiveTrainingImages.at(currentFile) : negativeTrainingImages.at(currentFile - positiveTrainingImages.size()));
// Output progress
if ( (currentFile+1) % 10 == 0 || (currentFile+1) == overallSamples ) {
percent = ((currentFile+1) * 100 / overallSamples);
printf("%5lu (%3.0f%%):\tFile '%s'", (currentFile+1), percent, currentImageFile.c_str());
fflush(stdout);
resetCursor();
}
// Calculate feature vector from current image file
calculateFeaturesFromInput(currentImageFile, featureVector, hog);
if (!featureVector.empty()) {
/* Put positive or negative sample class to file,
* true=positive, false=negative,
* and convert positive class to +1 and negative class to -1 for SVMlight
*/
File << ((currentFile < positiveTrainingImages.size()) ? "+1" : "-1");
// Save feature vector components
for (unsigned int feature = 0; feature < featureVector.size(); ++feature) {
File << " " << (feature + 1) << ":" << featureVector.at(feature);
}
File << endl;
}
}
printf("\n");
File.flush();
File.close();
} else {
printf("Error opening file '%s'!\n", featuresFile.c_str());
return EXIT_FAILURE;
}
// </editor-fold>
// <editor-fold defaultstate="collapsed" desc="Pass features to machine learning algorithm">
/// Read in and train the calculated feature vectors
printf("Calling SVMlight\n");
SVMlight::getInstance()->read_problem(const_cast<char*> (featuresFile.c_str()));
SVMlight::getInstance()->train(); // Call the core libsvm training procedure
printf("Training done, saving model file!\n");
SVMlight::getInstance()->saveModelToFile(svmModelFile);
// </editor-fold>
// <editor-fold defaultstate="collapsed" desc="Generate single detecting feature vector from calculated SVM support vectors and SVM model">
printf("Generating representative single HOG feature vector using svmlight!\n");
vector<float> descriptorVector;
vector<unsigned int> descriptorVectorIndices;
// Generate a single detecting feature vector (v1 | b) from the trained support vectors, for use e.g. with the HOG algorithm
SVMlight::getInstance()->getSingleDetectingVector(descriptorVector, descriptorVectorIndices);
// And save the precious to file system
saveDescriptorVectorToFile(descriptorVector, descriptorVectorIndices, descriptorVectorFile);
// </editor-fold>
// <editor-fold defaultstate="collapsed" desc="Test detecting vector">
cout << "Test Detecting Vector" << endl;
hog.setSVMDetector(descriptorVector); // Set our custom detecting vector
cout << "descriptorVector size: " << sizeof(descriptorVector) << endl;
getFilesInDirectory(tesSamplesDir, tesImages, validExtensions);
namedWindow("Test Detector", 1);
for( size_t it = 0; it < tesImages.size(); it++ )
{
cout << "Process image " << tesImages[it] << endl;
Mat image = imread( tesImages[it], 1 );
detectAndDrawObjects(image, hog);
for(;;)
{
int c = waitKey();
if( (char)c == 'n')
break;
else if( (char)c == '\x1b' )
exit(0);
}
}
// </editor-fold>
return EXIT_SUCCESS;
}