Unable to train/fine-tune with PReLU in caffe

Unable to train/fine-tune with PReLU in caffe - computer-vision

I am working in face recognition with deep neural network. I am using the CASIA-webface database of 10575 classes for training a deep CNN (used by CASIA, see the paper for details) of 10 Convolution, 5 Pooling and 1 Fully Connected layer. For the activation it uses "ReLU" function. I was able to successfully train it using caffe and obtained the desired performance.
My problem is that, I am unable to train/fine-tune the same CNN using "PReLU" activation. At first, I thought that a simple replace of "ReLU" with "PReLU" will do the job. However, none of fine-tuning (from caffemodel which was learned with "ReLU") and learn from scratch strategies worked.
In order to simplify the learning problem, I reduced the training dataset significantly only with 50 classes. However, yet the CNN was unable to learn with "PReLU", whereas it was able to learn with "ReLU".
In order to understand that my caffe works fine with "PReLU", I verified it by running simple networks (with both "ReLU" and "PReLU") using cifar10 data and it worked.
I would like to know from the community if anyone has similar observations. Or if anyone can provide any suggestion to overcome this problem.

The main difference between "ReLU" and "PReLU" activation is that the latter activation function has a non-zero slope for negative values of input, and that this slope can be learned from the data. It was observed that these properties make the training more robust to the random initialization of the weights.
I used "PReLU" activation for fine-tuning nets that were trained originally with "ReLU"s and I experienced faster and more robust convergence.
My suggestion is to replace "ReLU" with the following configuration
layer {
name: "prelu"
type: "PReLU"
bottom: "my_bottom"
top: "my_bottom" # you can make it "in-place" to save memory
param { lr_mult: 1 decay_mult: 0 }
prelu_param {
filler: { type: "constant" val: 0 }
channel_shared: false
}
}
Note that by initializing the negative slope to 0, the "PReLU" activations are in-fact the same as "ReLU" so you start the fine tuning from exactly the same spot as your original net.
Also note that I explicitly set the learning rate and decay rate coefficients (1 and 0 resp.) -- you might need to tweak these params a bit, though I believe setting the decay_weight to any value other than zero is not wise.

I was able to train fine using PReLu for my network, albeit with a bit lower accuracy that using ReLu. And yes, I simply swapped out ReLu with PReLu as well.
However, I have almost consistently noticed that PReLU converges much faster than ReLu. So, maybe you need to lower your learning rate?

Two basic observations:
PReLU is not guaranteed to produce results more accurate than those with ReLU. It worked better with AlexNet on ImageNet, but this merely suggests further research and refinement; it doesn't necessarily transfer to other applications.
CIFAR, ImageNet, and CASIA-webface are not identical applications.
You've already done the proper first step, changing the learning rate. Next, I would try tweaking the command-line arguments: change the convergence epsilon, momentum, weight decay, or other internal tuning parameters. Sometimes, it takes a tweak there to take advantage of even a minor topology change.
Change the input batch size. Are you allowed to alter the topology in other ways, such as altering a convolution layer? You might see what you get with a different approach to the CONV2 filters.

Related

Neural network and image classification

I have built an experimental neural network - the idea being that it can look at a JPEG image and identify which parts of the image are musical notation.
To train the network I have used various images of pages cut into 100 x 100 boxes which can either be valued at 1.0 (ie contains notation) or 0.0 (does not contain notation).
On training the network, though, it seems to have fixed itself that it - more or less - delivers a result of 0.5 every time (giving a square error of 0.25). The sigmoid (logistic) function is used for activation.
The network has 10,000 input neurons (for each pixel of the 100 x 100 image), 2000 hidden neurons (each input is attached to both a 'row' and a 'column' hidden neuron).
There is one output neuron.
Would I get better results with two output neurons? (ie one which activates for 'is music' and one which activates for 'is not music').
(You can see the C++ source for this here: https://github.com/mcmenaminadrian/musonet - though at any given time what is in the public repo may not be exactly what I am using on the machine.)

FWIW - the actual problem was because of the sign error in the code as described in the comment - so the two layers were fighting one another and, as you might expect, converged towards the middle.
But ... I based my code on a book from the 1990s - the much cited "Practical Neural Network Recipes in C++". There is nothing wrong with the book as such (though the C++ reflects that time's coding style and there is no use of STL classes and so on), but it does also come from an era where neural nets were not as well understood/engineered as today and so the basic design was quite flawed.
I'm now thinking about how best to implement a many layered convolutional network - not something discussed in the book at all (indeed it dismisses the idea of many layered networks relying instead on the fact that a single hidden layer NN is a general approximator).
I got some interesting results with the single hidden layer NN, but it's not really all that useful for image processing.

Training Tensorflow Inception-v3 Imagenet on modest hardware setup

I've been training Inception V3 on a modest machine with a single GPU (GeForce GTX 980 Ti, 6GB). The maximum batch size appears to be around 40.
I've used the default learning rate settings specified in the inception_train.py file: initial_learning_rate = 0.1, num_epochs_per_decay = 30 and learning_rate_decay_factor = 0.16. After a couple of weeks of training the best accuracy I was able to achieve is as follows (About 500K-1M iterations):
2016-06-06 12:07:52.245005: precision # 1 = 0.5767 recall # 5 = 0.8143 [50016 examples]
2016-06-09 22:35:10.118852: precision # 1 = 0.5957 recall # 5 = 0.8294 [50016 examples]
2016-06-14 15:30:59.532629: precision # 1 = 0.6112 recall # 5 = 0.8396 [50016 examples]
2016-06-20 13:57:14.025797: precision # 1 = 0.6136 recall # 5 = 0.8423 [50016 examples]
I've tried fiddling with the settings towards the end of the training session, but couldn't see any improvements in accuracy.
I've started a new training session from scratch with num_epochs_per_decay = 10 and learning_rate_decay_factor = 0.001 based on some other posts in this forum, but it's sort of grasping in the dark here.
Any recommendations on good defaults for a small hardware setup like mine?

TL,DR: There is no known method for training an Inception V3 model from scratch in a tolerable amount of time from a modest hardware set up. I would strongly suggest retraining a pre-trained model on your desired task.
On a small hardware set up like yours, it will be difficult to achieve maximum performance. Generally speaking for CNN's, the best performance is with the largest batch sizes possible. This means that for CNN's the training procedure is often limited by the maximum batch size that can fit in GPU memory.
The Inception V3 model available for download here was trained with an effective batch size of 1600 across 50 GPU's -- where each GPU ran a batch size of 32.
Given your modest hardware, my number one suggestion would be to download the pre-trained mode from the link above and retrain the model for the individual task you have at hand. This would make your life much happier.
As a thought experiment (but hardly practical) .. if you feel especially compelled to exactly match the training performance of the model from the pre-trained model by training from scratch, you could do the following insane procedure on your 1 GPU. Namely, you could run the following procedure:
Run with a batch size of 32
Store the gradients from the run
Repeat this 50 times.
Average the gradients from the 50 batches.
Update all variables with the gradients.
Repeat
I am only mentioning this to give you a conceptual sense of what would need to be accomplished to achieve the exact same performance. Given the speed numbers you mentioned, this procedure would take months to run. Hardly practical.
More realistically, if you are still strongly interested in training from scratch and doing the best you can, here are some general guidelines:
Always run with the largest batch size possible. It looks like you are already doing that. Great.
Make sure that you are not CPU bound. That is, make sure that the input processing queue's are always modestly full as displayed on TensorBoard. If not, increase the number of preprocessing threads or use a different CPU if available.
Re: learning rate. If you are always running synchronous training (which must be the case if you only have 1 GPU), then the higher batch size, the higher the tolerable learning rate. I would a try a series of several quick runs (e.g. a few hours each) to identify the highest learning possible which does not lead to NaN's. After you find such a learning rate, knock it down by say 5-10% and run with that.
As for num_epochs_per_decay and decay_rate, there are several strategies. The strategy highlighted by 10 epochs per decay, 0.001 decay factor is to hammer the model for as long as possible until the eval accuracy asymptotes. And then lower the learning rate. This is a simple strategy which is nice. I would verify that is what you see in your model monitoring that the eval accuracy and determining that it indeed asymptotes before you allow the model to decay the learning rate. Finally, the decay factor is a bit ad-hoc but lowering by say a power of 10 seems to be a good rule of thumb.
Note again that these are general guidelines and others might even offer differing advice. The reason why we can not give you more specific guidance is that CNNs of this size are just not often trained from scratch on a modest hardware setup.

Excellent tips.
There is precedence for training using a similar setup as yours.
Check this out - http://3dvision.princeton.edu/pvt/GoogLeNet/
These people trained GoogleNet, but, using Caffe. Still, studying their experience would be useful.

Neural Network gives same output for different inputs, doesn't learn

I have a neural network written in standard C++11 which I believe follows the back-propagation algorithm correctly (based on this). If I output the error in each step of the algorithm, however, it seems to oscillate without dampening over time. I've tried removing momentum entirely and choosing a very small learning rate (0.02), but it still oscillates at roughly the same amplitude per network (with each network having a different amplitude within a certain range).
Further, all inputs result in the same output (a problem I found posted here before, although for a different language. The author also mentions that he never got it working.)
The code can be found here.
To summarize how I have implemented the network:
Neurons hold the current weights to the neurons ahead of them, previous changes to those weights, and the sum of all inputs.
Neurons can have their value (sum of all inputs) accessed, or can output the result of passing said value through a given activation function.
NeuronLayers act as Neuron containers and set up the actual connections to the next layer.
NeuronLayers can send the actual outputs to the next layer (instead of pulling from the previous).
FFNeuralNetworks act as containers for NeuronLayers and manage forward-propagation, error calculation, and back-propagation. They can also simply process inputs.
The input layer of an FFNeuralNetwork sends its weighted values (value * weight) to the next layer. Each neuron in each layer afterwards outputs the weighted result of the activation function unless it is a bias, or the layer is the output layer (biases output the weighted value, the output layer simply passes the sum through the activation function).
Have I made a fundamental mistake in the implementation (a misunderstanding of the theory), or is there some simple bug I haven't found yet? If it would be a bug, where might it be?
Why might the error oscillate by the amount it does (around +-(0.2 +- learning rate)) even with a very low learning rate? Why might all the outputs be the same, no matter the input?
I've gone over most of it so much that I might be skipping over something, but I think I may have a plain misunderstanding of the theory.

It turns out I was just staring at the FFNeuralNetwork parts too much and accidentally used the wrong input set to confirm the correctness of the network. It actually does work correctly with the right learning rate, momentum, and number of iterations.
Specifically, in main, I was using inputs instead of a smaller array in to test the outputs of the network.

machine learning in c++

I am working on vision project using ( c++ and opencv )
I need to classify 5 number of double , so Is there function in opencv to classify vector of double ?
and if not exist like this function , What is the easiest way to classify vector of double in c++ ?

I extracted 5 points from the edges of the human body, head and hands
and feet and I need to train a neural network in order to identify if
the object is a human being or not
For that purpose would be better to use a Viola-Jones classificator, I think. However, OpenCV provides Multi-Layer-Perceptron (MLP) which you can easily use for this.
You have to create a big (>1000) training set which contains five doubles for each item. Then you have to use each time 5% or 10% elements of that set to create a test set.
See Multi-Layer-Perceptron here for more information about theory and implementation.
However I warn you that with such classifier you probably won't get good results as 5 points are probably not sufficient and you may have many false positives.

Face Recognition Using Backpropagation Neural Network?

I'm very new in image processing and my first assignment is to make a working program which can recognize faces and their names.
Until now, I successfully make a project to detect, crop the detected image, make it to sobel and translate it to array of float.
But, I'm very confused how to implement the Backpropagation MLP to learn the image so it can recognize the correct name for the detected face.
It's a great honor for all experts in stackoverflow to give me some examples how to implement the Image array to be learned with the backpropagation.

It is standard machine learning algorithm. You have a number of arrays of floats (instances in ML or observations in statistics terms) and corresponding names (labels, class tags), one per array. This is enough for use in most ML algorithms. Specifically in ANN, elements of your array (i.e. features) are inputs of the network and labels (names) are its outputs.
If you are looking for theoretical description of backpropagation, take a look at Stanford's ml-class lectures (ANN section). If you need ready implementation, read this question.
You haven't specified what are elements of your arrays. If you use just pixels of original image, this should work, but not very well. If you need production level system (though still with the use of ANN), try to extract more high level features (e.g. Haar-like features, that OpenCV uses itself).

Have you tried writing your feature vectors to an arff file and to feed them to weka, just to see if your approach might work at all?
Weka has a lot of classifiers integrated, including MLP.
As I understood so far, I suspect the features and the classifier you have chosen not to work.
To your original question: Have you made any attempts to implement a neural network on your own? If so, where you got stuck? Note, that this is not the place to request a complete working implementation from the audience.
To provide a general answer on a general question:
Usually you have nodes in an MLP. Specifically input nodes, output nodes, and hidden nodes. These nodes are strictly organized in layers. The input layer at the bottom, the output layer on the top, hidden layers in between. The nodes are connected in a simple feed-forward fashion (output connections are allowed to the next higher layer only).
Then you go and connect each of your float to a single input node and feed the feature vectors to your network. For your backpropagation you need to supply an error signal that you specify for the output nodes. So if you have n names to distinguish, you may use n output nodes (i.e. one for each name). Make them for example return 1 in case of a match and 0 else. You could very well use one output node and let it return n different values for the names. Probably it would even be best to use n completely different perceptrons, i.e. one for each name, to avoid some side-effects (catastrophic interference).
Note, that the output of each node is a number, not a name. Therefore you need to use some sort of thresholds, to get a number-name relation.
Also note, that you need a lot of training data to train a large network (i.e. to obey the curse of dimensionality). It would be interesting to know the size of your float array.
Indeed, for a complex decision you may need a larger number of hidden nodes or even hidden layers.
Further note, that you may need to do a lot of evaluation (i.e. cross validation) to find the optimal configuration (number of layers, number of nodes per layer), or to find even any working configuration.
Good luck, any way!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js