I have been trying to implement an ensemble of exemplar SVM using OpenCV. The general idea beyond it is that, given K exemplars for training (e.g., images of cars), one can train K SVMs, where each SVM is trained using one positive sample and K-1 negatives. At testing time then, the ensemble will fire K times, with the highest score being the best prediction.
To implement this, I am using OpenCV SVM implementation (using current GIT master - 3.0 at time of writing), and the C++ interface. As features, I am using HOGs, with the response being ~7000 in size. Hence each image has a 7000D feature vector.
The problem I am facing is that the various SVM do not train properly. Actually, they do not train at all! The training in fact executes extremely fast, and always returns 1 support vector per SVM, with alpha=1.0. I am not sure if this is due to the fact that I have a single positive versus many (>900) negatives, or if it's simply a bug in my code. However, after looking at it several times, I cannot spot any obvious mistakes.
This is how I set-up my problem (assuming we got the HOG responses for the entire dataset and put them in a std::vector > trainingData ). Please note that EnsambleSVMElement is a Struct holding the SVM plus a bunch of other info.
In brief: I set-up a training matrix, where each row contains the HOG response for a particular sample. I then start training each SVM separately. For each training iteration, I create a label vector, where each entry is set to -1 (negative sample), except the entry associate to the current SVM I am training which is set to 1 (so if I am training entry 100, the only positive label will be at labels[100]).
Training code
int ensambles = trainingData.size();
if(ensambles>1)
{
//get params to normalise the data in [0-1]
std::vector<float> mins(trainingData.size());
std::vector<float> maxes(trainingData.size());
for(int i=0; i<trainingData.size(); ++i)
{
mins[i] = *std::min_element(trainingData[i].begin(), trainingData[i].end());
maxes[i] = *std::max_element(trainingData[i].begin(), trainingData[i].end());
}
float min_val = *std::min_element(mins.begin(), mins.end());
float max_val = *std::min_element(maxes.begin(), maxes.end());
int featurevector_size = trainingData[0].size();
if(featurevector_size>0)
{
//set-up training data. i-th row contains HOG response for sample i
cv::Mat trainingDataMat(ensambles, featurevector_size, CV_32FC1);
for(int i=0; i<trainingDataMat.rows; ++i)
for(int j=0; j<trainingDataMat.cols; ++j)
trainingDataMat.at<float>(i, j) = (trainingData.at(i).at(j)-min_val)/(max_val-min_val); //make sure data are normalised in [0-1] - libSVM constraint
for(int i=0; i<ensambles; ++i)
{
std::vector<int> labels(ensambles, -1);
labels[i] = 1; //one positive only, and is the current sample
cv::Mat labelsMat(ensambles, 1, CV_32SC1, &labels[0]);
cv::Ptr<cv::ml::SVM> this_svm = cv::ml::StatModel::train<SVM>(trainingDataMat, ROW_SAMPLE, labelsMat, svmparams);
ensamble_svm.push_back(EnsambleSVMElement(this_svm));
Mat sv = ensamble_svm[i].svm->getSupportVectors();
std::cout << "SVM_" << i << " has " << ensamble_svm[i].svm->getSupportVectors().rows << " support vectors." << std::endl;
}
}
else
std::cout <<"You passed empty feature vectors!" << std::endl;
}
else
std::cout <<"I need at least 2 SVMs to create an ensamble!" << std::endl;
The cout always prints "SVM_i has 1 support vectors".
For completeness, these are my SVM parameters:
cv::ml::SVM::Params params;
params.svmType = cv::ml::SVM::C_SVC;
params.C = 0.1;
params.kernelType = cv::ml::SVM::LINEAR;
params.termCrit = cv::TermCriteria(cv::TermCriteria::MAX_ITER, (int)1e4, 1e-6);
Varying C between 0.1 and 1.0 doesn't affect the results. Neither does setting up weights for the samples, as read here. Just for reference, this is how I am setting up the weights (big penalty for negatives):
cv::Mat1f class_weights(1,2);
class_weights(0,0) = 0.01;
class_weights(0,1) = 0.99;
params.classWeights = class_weights;
There is clearly something wrong eiether in my code, or in my formulation of the problem. Can anyone spot that?
Thanks!
have you made any progress?
My guess is that your C parameter is too low, try bigger values (10, 100, 1000).
Another important aspect is that in the Exemplar SVM framework the training phase is not this simple. The author alternates between a training step and a hard negative mining step, in order to make the training phase more effective.
If you need more details than reported in the Exemplar-SVM article, you can look at Malisiewicz phd thesis: http://people.csail.mit.edu/tomasz/papers/malisiewicz_thesis.pdf
I think you have a small error here: the second line, I think it should be
*std::max_element(...), not *std::min_element(...)
float min_val = *std::min_element(mins.begin(), mins.end());
float max_val = *std::min_element(maxes.begin(), maxes.end());
Related
I am trying to develop an Artificial Neural Network in C++ using OpenCV 3.4.1 with the aim of being able to recognise 33 different characters, including both numbers and letters, but the results I am obtaining are always wrong.
I have tested my code with different parameters' values like the alpha and beta of the sigmoid function that I am using for training, the backpropagation parameters, or the number of hidden nodes but, although the result varies sometimes, it normally tends to be a vector of the following shape:
Classification result:
[20.855789, -0.033862107, -0.0053131776, 0.026316155, -0.0032050854,
0.036046479, -0.025410429, -0.017537225, 0.015429396, -0.023276867, 0.013653283, -0.025660357, -0.051959664, -0.0032470606, 0.032143779, -0.011631044, 0.022339549, 0.041757714, 0.04414707, -0.044756029, 0.042280547, 0.012204648, 0.026924053, 0.016814215, -0.028257577, 0.05190875, -0.0070033628, -0.0084492415, -0.040644459, 0.00022287761, -0.0376678, -0.0021550131, -0.015310903]
That is, independently of which character I test, it is always predicting that the analysed character is the one in the first position of the characters vector, which corresponds to number '1'.
The training data is obtained from an .XML I have created, which contains 474 samples (rows) with 265 attributes each (cols). As for the training classes, following some advice I found in a previous question in this forum, it is obtained from another .XML file that contains 474 rows, one for each training sample, and 33 columns, one for each character/class.
I attach the code below so that you can perhaps kindly guess what I am doing wrong and I am so thankful in advance for any help you can offer! :)
//Create the Neural Network
Mat_<int> layerSizes(1, 3);
layerSizes(0, 0) = numFeaturesPerSample;
layerSizes(0, 1) = nlayers;
layerSizes(0, 2) = numClasses;
//Set ANN params
Ptr<ANN_MLP> network = ANN_MLP::create();
network->setLayerSizes(layerSizes);
network->setActivationFunction(ANN_MLP::SIGMOID_SYM, 0.6, 1);
network->setTrainMethod(ANN_MLP::BACKPROP, 0.1, 0.1);
Ptr<TrainData> trainData = TrainData::create(TrainingData, ROW_SAMPLE, classes);
network->train(trainData);
//Predict
if (network->isTrained())
{
trained = true;
Mat results;
cout << "Predict:" << endl;
network->predict(features, results);
cout << "Prediction done!" << endl;
cout << endl << "Classification result: " << endl << results << endl;
//We need to know where in output is the max val, the x (cols) is the class.
Point maxLoc;
double maxVal;
minMaxLoc(results, 0, &maxVal, 0, &maxLoc);
return maxLoc.x;
}
Basically, I want to detect a fault in an image using logistic regression. I'm hoping to get so feedback on my approach, which is as follows:
For training:
Take a small section of the image marked "bad" and "good"
Greyscale them, then break them up into a series of 5*5 pixel segments
Calculate the histogram of pixel intensities for each of these segments
Pass the histograms along with the labels to the Logistic Regression class for training
Break the whole image into 5*5 segments and predict "good"/"bad" for each segment.
Using the sigmod function the linear regression equation is:
1/ (1 - e^(xθ))
Where x is the input values and theta (θ) is the weights. I use gradient descent to train the network. My code for this is:
void LogisticRegression::Train(float **trainingSet,float *labels, int m)
{
float tempThetaValues[m_NumberOfWeights];
for (int iteration = 0; iteration < 10000; ++iteration)
{
// Reset the temp values for theta.
memset(tempThetaValues,0,m_NumberOfWeights*sizeof(float));
float error = 0.0f;
// For each training set in the example
for (int trainingExample = 0; trainingExample < m; ++trainingExample)
{
float * x = trainingSet[trainingExample];
float y = labels[trainingExample];
// Partial derivative of the cost function.
float h = Hypothesis(x) - y;
for (int i =0; i < m_NumberOfWeights; ++i)
{
tempThetaValues[i] += h*x[i];
}
float cost = h-y; //Actual J(theta), Cost(x,y), keeps giving NaN use MSE for now
error += cost*cost;
}
// Update the weights using batch gradient desent.
for (int theta = 0; theta < m_NumberOfWeights; ++theta)
{
m_pWeights[theta] = m_pWeights[theta] - 0.1f*tempThetaValues[theta];
}
printf("Cost on iteration[%d] = %f\n",iteration,error);
}
}
Where sigmoid and the hypothesis are calculated using:
float LogisticRegression::Sigmoid(float z) const
{
return 1.0f/(1.0f+exp(-z));
}
float LogisticRegression::Hypothesis(float *x) const
{
float z = 0.0f;
for (int index = 0; index < m_NumberOfWeights; ++index)
{
z += m_pWeights[index]*x[index];
}
return Sigmoid(z);
}
And the final prediction is given by:
int LogisticRegression::Predict(float *x)
{
return Hypothesis(x) > 0.5f;
}
As we are using a histogram of intensities the input and weight arrays are 255 elements. My hope is to use it on something like a picture of an apple with a bruise and use it to identify the brused parts. The (normalized) histograms for the whole brused and apple training sets look somthing like this:
For the "good" sections of the apple (y=0):
For the "bad" sections of the apple (y=1):
I'm not 100% convinced that using the intensites alone will produce the results I want but even so, using it on a clearly seperable data set isn't working either. To test it I passed it a, labeled, completely white and a completely black image. I then run it on the small image below:
Even on this image it fails to identify any segments as being black.
Using MSE I see that the cost is converging downwards to a point where it remains, for the black and white test it starts at about cost 250 and settles on 100. The apple chuncks start at about 4000 and settle on 1600.
What I can't tell is where the issues are.
Is, the approach sound but the implementation broken? Is logistic regression the wrong algorithm to use for this task? Is gradient decent not robust enough?
I forgot to answer this... Basically the problem was in my histograms which when generated weren't being memset to 0. As to the overall problem of whether or not logistic regression with greyscale images was a good solution, the answer is no. Greyscale just didn't provide enough information for good classification. Using all colour channels was a bit better but I think the complexity of the problem I was trying to solve (bruises in apples) was a bit much for simple logistic regression on its own. You can see the results on my blog here.
I want to perform Template matching with mask. In general Template matching can be made faster by converting the image from Spacial domain into Frequency domain. But is there any any method i can apply if i want to perform the same with mask? I'm using opencv c++. Is there any matching function already there in opencv for this task?
My current Approach:
Bitwise Xor Image A & Image B with Mask.
Count the Non-Zero Pixels.
Fill the Resultant matrix with this count.
Search for maxi-ma.
Few parameters I'm guessing now are:
Skip the Tile position if the matches are less than 25%.
Skip the tile position if the matches are less than 25%.
Skip the Tile position if the previous Tile has matches are less than 50%.
My question: is there any algorithm to do this matching already? Is there any mathematical operation which can speed up this process?
With binary images, you can use directly HU-Moments and Mahalanobis distance to find if image A is similar to image B. If the distance tends to 0, then the images are the same.
Of course you can use also Features detectors so see what matches, but for pictures like these, HU Moments or Features detectors will give approximately same results, but HU Moments are more efficient.
Using findContours, you can extract the black regions inside the white star and fill them, in order to have image A = image B.
Other approach: using findContours on your mask and apply the result to Image A (extracting the Region of Interest), you can extract what's inside the star and count how many black pixels you have (the mismatching ones).
I have same requirement and I have tried the almost same way. As in the image, I want to match the castle. The castle has a different shield image and variable length clan name and also grass background(This image comes from game Clash of Clans). The normal opencv matchTemplate does not work. So I write my own.
I follow the ways of matchTemplate to create a result image, but with different algorithm.
The core idea is to count the matched pixel under the mask. The code is following, it is simple.
This works fine, but the time cost is high. As you can see, it costs 457ms.
Now I am working on the optimization.
The source and template images are both CV_8U3C, mask image is CV_8U. Match one channel is OK. It is more faster, but it still costs high.
Mat tmp(matTempl.cols, matTempl.rows, matTempl.type());
int matchCount = 0;
float maxVal = 0;
double areaInvert = 1.0 / countNonZero(matMask);
for (int j = 0; j < resultRows; j++)
{
float* data = imgResult.ptr<float>(j);
for (int i = 0; i < resultCols; i++)
{
Mat matROI(matSource, Rect(i, j, matTempl.cols, matTempl.rows));
tmp.setTo(Scalar(0));
bitwise_xor(matROI, matTempl, tmp);
bitwise_and(tmp, matMask, tmp);
data[i] = 1.0f - float(countNonZero(tmp) * areaInvert);
if (data[i] > matchingDegree)
{
SRect rc;
rc.left = i;
rc.top = j;
rc.right = i + imgTemplate.cols;
rc.bottom = j + imgTemplate.rows;
rcOuts.push_back(rc);
if ( data[i] > maxVal)
{
maxVal = data[i];
maxIndex = rcOuts.size() - 1;
}
if (++matchCount == maxMatchs)
{
Log_Warn("Too many matches, stopped at: " << matchCount);
return true;
}
}
}
}
It says I have not enough reputations to post image....
http://i.stack.imgur.com/mJrqU.png
New added:
I success optimize the algorithm by using key points. Calculate all the points is cost, but it is faster to calculate only server key points. See the picture, the costs decrease greatly, now it is about 7ms.
I still can not post image, please visit: http://i.stack.imgur.com/ePcD9.png
Please give me reputations, so I can post images. :)
There is a technical formulation for template matching with mask in OpenCV Documentation, which works well. It can be used by calling cv::matchTemplate and its source code is also available under the Intel License.
I'm fairly new to OpenCV and C++ ( learning it now after doing a fair share if image processing on MATLAB and LabView).
I'm having a weird issue I wanted to ask your opinion.
I'm trying to do a fairly simple thing: moving window 1x9 stdev on a gray scaled image (~ 4500X2000 pix).
here is the heart of the code:
Mat src = imread("E:\\moon project\\Photos\\Skyline testing\\IMGP6043 sourse.jpg");
Scalar roi_mean, roi_stdev;
Mat stdev_map(src.rows, src.cols, CV_64FC1,Scalar(0));
cvtColor(src, src_gray, CV_BGR2GRAY);
int t = clock();
for (int i = 0; i < src_gray.cols - 1; i++)
{
for (int j = 0; j < src_gray.rows - 8; j++)
{
meanStdDev(src_gray.col(i).rowRange(j,j+9), roi_mean, roi_stdev);
stdev_map.at<double>(j, i) = roi_stdev[0];
}
}
t = clock() - t;
cout << "stdev calc : " << t << " msec" << endl;
Now on the aforementioned image it takes 35 seconds to run the double loop (delta t value) and even if I throw away the meanStdDev and just assign a constant to stdev_map.at(j, i) it still takes 14 seconds to run the double loop.
I'm pretty sure I'm doing something wrong since on Labview it takes only 2.5 seconds to chew this baby with the exact same math.
Please help me.
To answer your question and some of the comments: do compile the lib in release mode will surely increase the computation time, by what order it depends, for example if you are using eigen it probably will speed things up a lot.
If you really want to do the loop by yourself, consider getting the row pointer to the data directly mat.data, or mat.ptr<cv::Vec3b>.
If you want to speed up the task of computing mean/stdDev on any part of your image, then use integral images. The doc is pretty clear about it, and I'm pretty sure it will take less than 2.5s probably even in debug mode.
The point of the application is to recognize an image from an already set list of images. The list of images have had their SIFT descriptors extracted and saved in files. Nothing interesting here:
std::vector<cv::KeyPoint> detectedKeypoints;
cv::Mat objectDescriptors;
// Extract data
cv::SIFT sift;
sift.detect(image, detectedKeypoints);
sift.compute(image, detectedKeypoints, objectDescriptors);
// Save the file
cv::FileStorage fs(file, cv::FileStorage::WRITE);
fs << "descriptors" << objectDescriptors;
fs << "keypoints" << detectedKeypoints;
fs.release();
Then the device takes a picture. SIFT descriptors are extracted in the same way. The idea now was to compare the descriptors to the ones from the files. I am doing that using the FLANN matcher from OpenCV. I am trying to quantify the similarity, image by image. After going through the whole list I should have the best match.
const cv::Ptr<cv::flann::IndexParams>& indexParams = new cv::flann::KDTreeIndexParams(1);
const cv::Ptr<cv::flann::SearchParams>& searchParams = new cv::flann::SearchParams(64);
// Match using Flann
cv::Mat indexMat;
cv::FlannBasedMatcher matcher(indexParams, searchParams);
std::vector< cv::DMatch > matches;
matcher.match(objectDescriptors, readDescriptors, matches);
After matching I understand that I get a list of the closest found distances between the feature vectors. I find the minimum distance and, using it I can count "good matches" and even get a list of the respective points:
// Count the number of mathes where the distance is less than 2 * min_dist
int goodCount = 0;
for (int i = 0; i < objectDescriptors.rows; i++)
{
if (matches[i].distance < 2 * min_dist)
{
++goodCount;
// Save the points for the homography calculation
obj.push_back(detectedKeypoints[matches[i].queryIdx].pt);
scene.push_back(readKeypoints[matches[i].trainIdx].pt);
}
}
I'm showing easy parts of the code just to make this more easy to follow, I know some of it doesn't need to be here.
Continuing, I was hoping that simply counting the number of good matches like this would be enough, but it turned out to mostly just point me to the image with the most descriptors. What I tried to after this was computing the homography. The aim was to compute it and see whether it's a valid homoraphy or not. The hope was that a good match, and only a good match, would have a homography that is a good transformation. Creating the homography was done simply using cv::findHomography on the obj and scene which are std::vector< cv::Point2f>. I checked the validity of the homography using some code I found online:
bool niceHomography(cv::Mat H)
{
std::cout << H << std::endl;
const double det = H.at<double>(0, 0) * H.at<double>(1, 1) - H.at<double>(1, 0) * H.at<double>(0, 1);
if (det < 0)
{
std::cout << "Homography: bad determinant" << std::endl;
return false;
}
const double N1 = sqrt(H.at<double>(0, 0) * H.at<double>(0, 0) + H.at<double>(1, 0) * H.at<double>(1, 0));
if (N1 > 4 || N1 < 0.1)
{
std::cout << "Homography: bad first column" << std::endl;
return false;
}
const double N2 = sqrt(H.at<double>(0, 1) * H.at<double>(0, 1) + H.at<double>(1, 1) * H.at<double>(1, 1));
if (N2 > 4 || N2 < 0.1)
{
std::cout << "Homography: bad second column" << std::endl;
return false;
}
const double N3 = sqrt(H.at<double>(2, 0) * H.at<double>(2, 0) + H.at<double>(2, 1) * H.at<double>(2, 1));
if (N3 > 0.002)
{
std::cout << "Homography: bad third row" << std::endl;
return false;
}
return true;
}
I don't understand the math behind this so, while testing, I sometimes replaced this function with a simple check whether the determinant of the homography was positive. The problem is that I kept having issues here. The homographies were either all bad, or good when they shouldn't have been (when I was checking only the determinant).
I figured I should actually use the homography and for a number of points just compute their position in the destination image using their position in the source image. Then I would compare these average distances, and I would ideally get a very obvious smaller average distance in the case of the correct image. This did not work at all. All the distances were colossal. I thought I might have used the homography the other way around to calculate the right position, but switching obj and scene with each other gave similar results.
Other things I tried were SURF descriptors instead of SIFT, BFMatcher (brute force) instead of FLANN, getting the n smallest distances for every image instead of a number depending on the minimum distance, or getting distances depending on a global maximum distance. None of these approaches gave me definite good results, and I feel stuck now.
My only next strategy would be to sharpen the images or even turn them to binary images using some local threshold or some algorithms used for segmentation. I am looking for any suggestions or mistake anyone can see in my work.
I don't know whether this is relevant, but I added some of the images I am testing this on. Many times in the test images most of the SIFT vectors come from the frame (higher contrast) than the painting. This is why I'm thinking sharpening the images might work, but I don't want to go deeper in case something I did previously is wrong.
The gallery of images is here with the descriptions in the titles. The images are of quite high resolution, please view in case it might give some hints.
You can try to test if when matching, the lines between the source image and the target image are relatively parallel. If it's not a correct match, then you'd have a lot of noise and the lines won't be parallel.
See the attached image which shows a correct match (using SURF and BF) - all the lines are mostly parallel (though I should point out that this is an easy example).
You are going correct way.
First, use second nearest ratio isntead of your "good match by 2*min_dist" https://stackoverflow.com/a/23019889/1983544.
Second, use homography other way. When you find homography, you have not only H ,matrix, but the number of correspondences consistent with it. Check if it is some reasonable number, say >=15. If less, than object is not matched.
Third, if you have a big viewpoint change, SIFT or SURF are unable to match images. Try to use MODS instead (http://cmp.felk.cvut.cz/wbs/ here is Windows and Linux binaries, as well as paper describing algorithm) or ASIFT (much slower and matches much worse, but open source) http://www.ipol.im/pub/art/2011/my-asift/
Or at least use MSER or Hessian-Affine detector instead of SIFT (retaining SIFT as descriptor).