Optimizing SMO with RBFKernel (C and gamma)

Optimizing SMO with RBFKernel (C and gamma) - data-mining

There are two parameters while using RBF kernels with Support Vector Machines: C and γ. It is not known beforehand which C and γ are the best for one problem; consequently some kind of model selection (parameter search) must be done. The goal is to identify good (C;γ) so that the classier can accurately predict unknown data (i.e., testing data).
weka.classifiers.meta.GridSearch is a meta-classifier for tuning a pair of parameters. It seems, however, that it takes ages to finish (when the dataset is rather large). What would you suggest to do in order to bring down the time required to accomplish this task?
According to A User's Guide to Support Vector Machines:
C : soft-margin constant . A smaller value of C allows to ignore points close to the boundary, and increases the margin.
γ> 0 is a parameter that controls the width of Gaussian

Hastie et al.'s SVMPath explores the entire regularization path for C and only requires about the same computational cost of training a single SVM model. From their paper:
Our R function SvmPath computes all 632 steps in the mixture example (n+ = n− =
100, radial kernel, γ = 1) in 1.44(0.02) secs on a pentium 4, 2Ghz linux machine; the svm
function (using the optimized code libsvm, from the R library e1071) takes 9.28(0.06)
seconds to compute the solution at 10 points along the path. Hence it takes our procedure
about 50% more time to compute the entire path, than it costs libsvm to compute a typical
single solution.
They released a GPLed implementation of the algorithm in R that you can download from CRAN here.
Using SVMPath should allow you to find a good C value for any given γ quickly. However, you would still need to do separate training runs for different γ values. But, this should be much faster than doing separate runs for each pair of C:γ values.

Related

Recursive Feature Elimination CV in Sklearn changes when I remove features

I'm using the RFECV module in sklearn to find the optimal number of features to yield the highest Cross validation on 2 folds. I am using a ridge regressor as my estimator.
rfecv = RFECV(estimator=ridge,step=1, cv=KFold(n_splits=2))
rfecv.fit(df, y)
I have 5 features in my dataset that I have standardized using the standardscaler.
I'll run the RFECV on my data, and it'll say that 2 features is optimal. But when I remove one of the features with the lowest regression coefficient and rerun the RFECV, it now says that 3 features is optimal.
When I progress through all features one at a time (as the recursive should do) I find that 3 is in fact the optimal.
I've tested this with other datasets, and have found that the optimal number of features changes as I remove features one at a time and rerun RFECV.
I might be missing something, but isn't that what RFECV is supposed to solve?
Any additional insights on RFECV is appreciated.

This makes sense actually. RFECV is recommending a certain number of features based on the available data. When you remove the feature you change the scoring range.
from the docs:
# Determine the number of subsets of features by fitting across
# the train folds and choosing the "features_to_select" parameter
# that gives the least averaged error across all folds.
...
n_features_to_select = max(
n_features - (np.argmax(scores) * step),
n_features_to_select)
n_features_to_select is used to determine how many features should be used in RFE for any particular iteration (within/under-the-hood of RFECV).
rfe = RFE(estimator=self.estimator,
n_features_to_select=n_features_to_select,
step=self.step, verbose=self.verbose)
And so this is directly connected to the number of features you include in your initial rfecv.fit() step.
Also, removing the feature with the lowest regression coefficient is not the best way to trim features. The coefficient is a reflection of its impact on the dependent variable not necessarily the model's accuracy.

Training Tensorflow Inception-v3 Imagenet on modest hardware setup

I've been training Inception V3 on a modest machine with a single GPU (GeForce GTX 980 Ti, 6GB). The maximum batch size appears to be around 40.
I've used the default learning rate settings specified in the inception_train.py file: initial_learning_rate = 0.1, num_epochs_per_decay = 30 and learning_rate_decay_factor = 0.16. After a couple of weeks of training the best accuracy I was able to achieve is as follows (About 500K-1M iterations):
2016-06-06 12:07:52.245005: precision # 1 = 0.5767 recall # 5 = 0.8143 [50016 examples]
2016-06-09 22:35:10.118852: precision # 1 = 0.5957 recall # 5 = 0.8294 [50016 examples]
2016-06-14 15:30:59.532629: precision # 1 = 0.6112 recall # 5 = 0.8396 [50016 examples]
2016-06-20 13:57:14.025797: precision # 1 = 0.6136 recall # 5 = 0.8423 [50016 examples]
I've tried fiddling with the settings towards the end of the training session, but couldn't see any improvements in accuracy.
I've started a new training session from scratch with num_epochs_per_decay = 10 and learning_rate_decay_factor = 0.001 based on some other posts in this forum, but it's sort of grasping in the dark here.
Any recommendations on good defaults for a small hardware setup like mine?

TL,DR: There is no known method for training an Inception V3 model from scratch in a tolerable amount of time from a modest hardware set up. I would strongly suggest retraining a pre-trained model on your desired task.
On a small hardware set up like yours, it will be difficult to achieve maximum performance. Generally speaking for CNN's, the best performance is with the largest batch sizes possible. This means that for CNN's the training procedure is often limited by the maximum batch size that can fit in GPU memory.
The Inception V3 model available for download here was trained with an effective batch size of 1600 across 50 GPU's -- where each GPU ran a batch size of 32.
Given your modest hardware, my number one suggestion would be to download the pre-trained mode from the link above and retrain the model for the individual task you have at hand. This would make your life much happier.
As a thought experiment (but hardly practical) .. if you feel especially compelled to exactly match the training performance of the model from the pre-trained model by training from scratch, you could do the following insane procedure on your 1 GPU. Namely, you could run the following procedure:
Run with a batch size of 32
Store the gradients from the run
Repeat this 50 times.
Average the gradients from the 50 batches.
Update all variables with the gradients.
Repeat
I am only mentioning this to give you a conceptual sense of what would need to be accomplished to achieve the exact same performance. Given the speed numbers you mentioned, this procedure would take months to run. Hardly practical.
More realistically, if you are still strongly interested in training from scratch and doing the best you can, here are some general guidelines:
Always run with the largest batch size possible. It looks like you are already doing that. Great.
Make sure that you are not CPU bound. That is, make sure that the input processing queue's are always modestly full as displayed on TensorBoard. If not, increase the number of preprocessing threads or use a different CPU if available.
Re: learning rate. If you are always running synchronous training (which must be the case if you only have 1 GPU), then the higher batch size, the higher the tolerable learning rate. I would a try a series of several quick runs (e.g. a few hours each) to identify the highest learning possible which does not lead to NaN's. After you find such a learning rate, knock it down by say 5-10% and run with that.
As for num_epochs_per_decay and decay_rate, there are several strategies. The strategy highlighted by 10 epochs per decay, 0.001 decay factor is to hammer the model for as long as possible until the eval accuracy asymptotes. And then lower the learning rate. This is a simple strategy which is nice. I would verify that is what you see in your model monitoring that the eval accuracy and determining that it indeed asymptotes before you allow the model to decay the learning rate. Finally, the decay factor is a bit ad-hoc but lowering by say a power of 10 seems to be a good rule of thumb.
Note again that these are general guidelines and others might even offer differing advice. The reason why we can not give you more specific guidance is that CNNs of this size are just not often trained from scratch on a modest hardware setup.

Excellent tips.
There is precedence for training using a similar setup as yours.
Check this out - http://3dvision.princeton.edu/pvt/GoogLeNet/
These people trained GoogleNet, but, using Caffe. Still, studying their experience would be useful.

How to measure latency of low latency c++ application

I need to measure message decoding latency (3 to 5 us ) of a low latency application.
I used following method,
1. Get time T1
2. Decode Data
3. Get time T2
4. L1 = T2 -T1
5. Store L1 in a array (size = 100000)
6. Repeat same steps for 100000 times.
7. Print array.
8. Get the 99% and 95% presentile for the data set.
But i got fluctuation between each test. Can some one explain the reason for this ?
Could you suggest any alternative method for this.
Note: Application is tight loop (acquire 100% cpu) and Bind to CPU via taskset commad

There are a number of different ways that performance metrics can be gathered either using code profilers or by using existing system calls.
NC State University has a good resource on the different types of timers and profilers that are available as well as the appropriate case for using each and some examples on their HPC website here.
Fluctuations will inevitably occur on most modern systems, certain BIOS setting related to hyper threading and frequency scaling can have a significant impact on the performance of certain applications, as can power-consumption and cooling/environmental settings.
Looking at the distribution of results as a histogram and/or fitting them to a Gaussian will also help determine how normal the distribution is and if the fluctuations are normal statistical noise or serious outliers. Running additional tests would also be beneficial.

machine learning in c++

I am working on vision project using ( c++ and opencv )
I need to classify 5 number of double , so Is there function in opencv to classify vector of double ?
and if not exist like this function , What is the easiest way to classify vector of double in c++ ?

I extracted 5 points from the edges of the human body, head and hands
and feet and I need to train a neural network in order to identify if
the object is a human being or not
For that purpose would be better to use a Viola-Jones classificator, I think. However, OpenCV provides Multi-Layer-Perceptron (MLP) which you can easily use for this.
You have to create a big (>1000) training set which contains five doubles for each item. Then you have to use each time 5% or 10% elements of that set to create a test set.
See Multi-Layer-Perceptron here for more information about theory and implementation.
However I warn you that with such classifier you probably won't get good results as 5 points are probably not sufficient and you may have many false positives.

Face Recognition Using Backpropagation Neural Network?

I'm very new in image processing and my first assignment is to make a working program which can recognize faces and their names.
Until now, I successfully make a project to detect, crop the detected image, make it to sobel and translate it to array of float.
But, I'm very confused how to implement the Backpropagation MLP to learn the image so it can recognize the correct name for the detected face.
It's a great honor for all experts in stackoverflow to give me some examples how to implement the Image array to be learned with the backpropagation.

It is standard machine learning algorithm. You have a number of arrays of floats (instances in ML or observations in statistics terms) and corresponding names (labels, class tags), one per array. This is enough for use in most ML algorithms. Specifically in ANN, elements of your array (i.e. features) are inputs of the network and labels (names) are its outputs.
If you are looking for theoretical description of backpropagation, take a look at Stanford's ml-class lectures (ANN section). If you need ready implementation, read this question.
You haven't specified what are elements of your arrays. If you use just pixels of original image, this should work, but not very well. If you need production level system (though still with the use of ANN), try to extract more high level features (e.g. Haar-like features, that OpenCV uses itself).

Have you tried writing your feature vectors to an arff file and to feed them to weka, just to see if your approach might work at all?
Weka has a lot of classifiers integrated, including MLP.
As I understood so far, I suspect the features and the classifier you have chosen not to work.
To your original question: Have you made any attempts to implement a neural network on your own? If so, where you got stuck? Note, that this is not the place to request a complete working implementation from the audience.
To provide a general answer on a general question:
Usually you have nodes in an MLP. Specifically input nodes, output nodes, and hidden nodes. These nodes are strictly organized in layers. The input layer at the bottom, the output layer on the top, hidden layers in between. The nodes are connected in a simple feed-forward fashion (output connections are allowed to the next higher layer only).
Then you go and connect each of your float to a single input node and feed the feature vectors to your network. For your backpropagation you need to supply an error signal that you specify for the output nodes. So if you have n names to distinguish, you may use n output nodes (i.e. one for each name). Make them for example return 1 in case of a match and 0 else. You could very well use one output node and let it return n different values for the names. Probably it would even be best to use n completely different perceptrons, i.e. one for each name, to avoid some side-effects (catastrophic interference).
Note, that the output of each node is a number, not a name. Therefore you need to use some sort of thresholds, to get a number-name relation.
Also note, that you need a lot of training data to train a large network (i.e. to obey the curse of dimensionality). It would be interesting to know the size of your float array.
Indeed, for a complex decision you may need a larger number of hidden nodes or even hidden layers.
Further note, that you may need to do a lot of evaluation (i.e. cross validation) to find the optimal configuration (number of layers, number of nodes per layer), or to find even any working configuration.
Good luck, any way!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js