I have a problem that I would like to solve using neural networks. I have a basic understanding of how cascade correlated networks work, but I am not sure if I can use them in an example without complete retraining.
For example say I want to train a XOR example, but I only have the first three triplets of inputs/outputs:
0 0 0
0 1 1
1 0 1
I understand how to train the network for these inputs/outputs, but say I want to add a fourth triplet:
1 1 0
without completely retraining the whole network. If I understand the algorithm correctly it should be possible, but I haven't found an appropriate C++ library or MATLAB toolbox that implements this.
I don't know any implementations, but people have made online versions of cascade correlation - which means constantly updating the network with new training data as it comes in, rather than training once on a static dataset.
I'm not sure how that works. I believe they just add new neurons every so often. You could also backprop through the whole thing like it was a normal neural network.
Related
I have built an experimental neural network - the idea being that it can look at a JPEG image and identify which parts of the image are musical notation.
To train the network I have used various images of pages cut into 100 x 100 boxes which can either be valued at 1.0 (ie contains notation) or 0.0 (does not contain notation).
On training the network, though, it seems to have fixed itself that it - more or less - delivers a result of 0.5 every time (giving a square error of 0.25). The sigmoid (logistic) function is used for activation.
The network has 10,000 input neurons (for each pixel of the 100 x 100 image), 2000 hidden neurons (each input is attached to both a 'row' and a 'column' hidden neuron).
There is one output neuron.
Would I get better results with two output neurons? (ie one which activates for 'is music' and one which activates for 'is not music').
(You can see the C++ source for this here: https://github.com/mcmenaminadrian/musonet - though at any given time what is in the public repo may not be exactly what I am using on the machine.)
FWIW - the actual problem was because of the sign error in the code as described in the comment - so the two layers were fighting one another and, as you might expect, converged towards the middle.
But ... I based my code on a book from the 1990s - the much cited "Practical Neural Network Recipes in C++". There is nothing wrong with the book as such (though the C++ reflects that time's coding style and there is no use of STL classes and so on), but it does also come from an era where neural nets were not as well understood/engineered as today and so the basic design was quite flawed.
I'm now thinking about how best to implement a many layered convolutional network - not something discussed in the book at all (indeed it dismisses the idea of many layered networks relying instead on the fact that a single hidden layer NN is a general approximator).
I got some interesting results with the single hidden layer NN, but it's not really all that useful for image processing.
I convert my image data to caffe db format (leveldb, lmdb) using C++ as example I use this code for imagenet.
Is data need to be shuffled, can I write to db all my positives and then all my negatives like 00000000111111111, or data need to be shuffled and labels should look like 010101010110101011010?
How caffe sample data from DB, is it true that it use random subset of all data with size = batch_size?
Should you shuffle the samples? Think about the learning process if you don't shuffle; caffe sees only 0 samples - what do you expect the algorithm to deduce? simply predict 0 all the time and everything is cool. If you have plenty of 0 before you hit the first 1 caffe will be very confident in predicting always 0. It will be very difficult to move the model from this point.
On the other hand, if it constantly sees a mix of 0 and 1 it learns from the beginning meaningful features for separating the examples.
Bottom line: it is very advantageous to shuffle the training samples, especially when using SGD-based approaches.
AFAIK, caffe does not randomly sample batch_size samples, but rather goes sequentially over the input DB batch_size after batch_size samples.
TL;DR
shuffle.
I am working in face recognition with deep neural network. I am using the CASIA-webface database of 10575 classes for training a deep CNN (used by CASIA, see the paper for details) of 10 Convolution, 5 Pooling and 1 Fully Connected layer. For the activation it uses "ReLU" function. I was able to successfully train it using caffe and obtained the desired performance.
My problem is that, I am unable to train/fine-tune the same CNN using "PReLU" activation. At first, I thought that a simple replace of "ReLU" with "PReLU" will do the job. However, none of fine-tuning (from caffemodel which was learned with "ReLU") and learn from scratch strategies worked.
In order to simplify the learning problem, I reduced the training dataset significantly only with 50 classes. However, yet the CNN was unable to learn with "PReLU", whereas it was able to learn with "ReLU".
In order to understand that my caffe works fine with "PReLU", I verified it by running simple networks (with both "ReLU" and "PReLU") using cifar10 data and it worked.
I would like to know from the community if anyone has similar observations. Or if anyone can provide any suggestion to overcome this problem.
The main difference between "ReLU" and "PReLU" activation is that the latter activation function has a non-zero slope for negative values of input, and that this slope can be learned from the data. It was observed that these properties make the training more robust to the random initialization of the weights.
I used "PReLU" activation for fine-tuning nets that were trained originally with "ReLU"s and I experienced faster and more robust convergence.
My suggestion is to replace "ReLU" with the following configuration
layer {
name: "prelu"
type: "PReLU"
bottom: "my_bottom"
top: "my_bottom" # you can make it "in-place" to save memory
param { lr_mult: 1 decay_mult: 0 }
prelu_param {
filler: { type: "constant" val: 0 }
channel_shared: false
}
}
Note that by initializing the negative slope to 0, the "PReLU" activations are in-fact the same as "ReLU" so you start the fine tuning from exactly the same spot as your original net.
Also note that I explicitly set the learning rate and decay rate coefficients (1 and 0 resp.) -- you might need to tweak these params a bit, though I believe setting the decay_weight to any value other than zero is not wise.
I was able to train fine using PReLu for my network, albeit with a bit lower accuracy that using ReLu. And yes, I simply swapped out ReLu with PReLu as well.
However, I have almost consistently noticed that PReLU converges much faster than ReLu. So, maybe you need to lower your learning rate?
Two basic observations:
PReLU is not guaranteed to produce results more accurate than those with ReLU. It worked better with AlexNet on ImageNet, but this merely suggests further research and refinement; it doesn't necessarily transfer to other applications.
CIFAR, ImageNet, and CASIA-webface are not identical applications.
You've already done the proper first step, changing the learning rate. Next, I would try tweaking the command-line arguments: change the convergence epsilon, momentum, weight decay, or other internal tuning parameters. Sometimes, it takes a tweak there to take advantage of even a minor topology change.
Change the input batch size. Are you allowed to alter the topology in other ways, such as altering a convolution layer? You might see what you get with a different approach to the CONV2 filters.
I am working on vision project using ( c++ and opencv )
I need to classify 5 number of double , so Is there function in opencv to classify vector of double ?
and if not exist like this function , What is the easiest way to classify vector of double in c++ ?
I extracted 5 points from the edges of the human body, head and hands
and feet and I need to train a neural network in order to identify if
the object is a human being or not
For that purpose would be better to use a Viola-Jones classificator, I think. However, OpenCV provides Multi-Layer-Perceptron (MLP) which you can easily use for this.
You have to create a big (>1000) training set which contains five doubles for each item. Then you have to use each time 5% or 10% elements of that set to create a test set.
See Multi-Layer-Perceptron here for more information about theory and implementation.
However I warn you that with such classifier you probably won't get good results as 5 points are probably not sufficient and you may have many false positives.
I have recently implemented a typical 3 layer neural network (input -> hidden -> output) and I'm using the sigmoid function for activation. So far, the host program has 3 modes:
Creation, which seems to work fine. It creates a network with a specified number of input, hidden and output neurons, initializes the weights to either random values or zero.
Training, which loads a dataset, computes the output of the network then backpropagates the error and updates the weights. As far as I can tell, this works ok. The weights change, but not extremely, after training on the dataset.
Processing, which seems to work ok. However, the data output for the dataset which was used for training, or any other dataset for that matter is very bad. It's usually either just a continuuous stream of 1's, with an occasional 0.999999 or every output value for every input is 0.9999 with the last digits being different between inputs. As far as I could tell there was no correlation between those last 2 digits and what was supposed to be outputed.
How should I go about figuring out what's not working right?
You need to find a set of parameters (number of neurons, learning rate, number of iterations for training) that works well for classifying previously unseen data. People often achieve this by separating their data into three groups: training, validation and testing.
Whatever you decide to do, just remember that it really doesn't make sense to be testing on the same data with which you trained, because any classifcation method close to reasonable should be getting everything 100% right under such a setup.