Deep Learning: Using a pretrained network's earlier activations - computer-vision

I have around 25k images belonging to 14 different classes (different kinds of neck-lines, e.g. v-neck, round neck, etc). The images mainly contain the the top-part of the apparel and/or the face of the model. Here are some examples:
In order to do this, I thought of extracting the features after the 1st block of VGG16 (pretrained on imagenet) because the feature map of the earlier blocks will be capturing things lines, shapes, etc. Here is the model.summary():
Layer (type) Output Shape Param #
=================================================================
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 802816) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 3288338432
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
predictions (Dense) (None, 16) 65552
=================================================================
Total params: 3,305,224,016
Trainable params: 3,305,224,016
Non-trainable params: 0
The problem is that the total number of parameters is huge. Can you please advise, considering my specific dataset, how to reduce that?

The problem is: a dense layer will create weights for each of the inputs.
So, because you've got an image full of pixels, and you haven't reduced its size significantly, adding Flatten + Dense will result in this absurd amount of weights.
I'm not sure I understand why you want only the first block. This is the block that will be identifying very rudimentar little shapes without considering their relations with each other. And certainly, there are not any earlier blocks.
I do recommend that you use more blocks to identify elaborated features and to reduce the image size. The best to do is simply take the entire VGG16 model with include_top=False and trainable=False, and add your trainable top layers - (See Here)
Now, if you really really want to have so few blocks and not reduce the image size, then you can try adding a GlobalMaxPooling2D layer. This will get only the max values in the entire image. This might or not be useful, depending on how possible it is to identify what you want with so few convolutions. (Another option is GlobalAveragePooling2D, but I believe this is even less effective in this case). In this case, more blocks will lead to better results.

You have to make your own feature extractor that comes by removing all the dense layers from the original VGG to build your own dense layers. I suggest you to put 2 dense layers also, fc1 with 1024 nodes and fc2 with 512 nodes. Of course you have to add a third dense layer that will be the fake classifier to train the feature extractor. Then train only this three layers keeping the rest of the VGG as trainable=false, that will reduce the parameters. For sure after training these layers you should remove the last one to have your feature extractor. Now for every image you would have 512 features that you can fit on a simple NN or a SVM as your choice that would be your classifier.
You should have a GPU with at least 8gb.
In Keras blog you can find how to finetune your last layers in order to have your own feature extractor that is more or less what are you looking for: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
Hope it helps!!

You're on the right track, but you've been confused (as I once was) by an inconsistency in the way people describe neural nets. In Keras docs, the "top" layer is what other docs call the "bottom", "last", "final", or "deepest" layer. It's the layer that calculates the final probabilities. To implement transfer learning, you freeze the early layers (where images enter the network and are convolved) and replace or re-train the final layer (where the answer comes out). Keras calls that final layer "top". So in Keras-talk, you either instantiate the model with include_top=False, or you remove a layer with model.pop().
I hope that helps.
The response from Eric is correct. I second his recommendation that you read this blog post by François Chollet, the creator of Keras: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

If you are using a pre-trained model, you don't have to re-train the lower layers and just keep the last few classification layers to be trainable. For example, if you'd like to freeze the first 6 layers you can call:
for idx, layer in enumerate(model.layers[:6]):
print('Make layer {} {} untrainable.'.format(idx, layer.name))
layer.trainable = False
Then if you call the model.summary(), you'll see that you have much less trainable parameters which will not only make the training faster but usually provides better results when you do not alter the pre-trained convolutional layers.

Related

Reshaping the input layer to a single channel and multiple images

In some reference code I have picked up, there is:
net_->input_blobs()[0]->Reshape(1, 3, height, width);
My prototxt has:
input_shape {
dim: 1
dim: 3
dim: 260
dim: 347
}
I have been indirectly informed that the model provided has been tuned for greyscale (we have both a colour and a greyscale prototxt), and the currently-used Python code uses a greyscaled input with three identical channels.
Now I want to do either both or separately process 4 images in a single call to net_->Forward(); and pass in these four images as one-channel greyscale. So, first, choosing a single channel:
net_->input_blobs()[0]->Reshape(1, 1, height, width);
What are the repercussions of changing the number of channels? How do all my layers react? Will it work? If it works, will a one-channel net be faster?
Second, choosing four images:
net_->input_blobs()[0]->Reshape(4, 3, height, width);
I have a feeling that won't work, and I should be looking at increasing the number of input_blobs, but how to do that? Or what is the correct approach?
working with a single channel rather than identical three should be faster (fewer multiplication-addition operations). Since this is done at the finest scale, this might even have noticeable impact on run time.
Feeding 4 images as a single batch is usually faster than processing each image separately as a batch with one image (due to internal optimization of the computation to work with batches).
Bottom line: you should get better run time running a single batch of four images. If the input is three identical channels - it is better to modify the model to work with only one.

OpenCV Neural network for images processing

I new in AI world and try some practice.
It looks like I need some third-party experience.
Let's say I need to get rid of image defects (actually the task more tricky).
I hope that trained NN will be able to interpolate defect area.
For these reasons I try to create simple neural network.
It has input : grayscale image with deffect(72*54) and the same image with no defect.
Hidden layer has 2*72*54 neurons.
Main piece of code
cv::Ptr<cv::ml::ANN_MLP> ann = cv::ml::ANN_MLP::create();
int inputsCount = imageSizes.width * imageSizes.height;
std::vector<int> layerSizes = { inputsCount, inputsCount * 2, inputsCount};
ann->setLayerSizes(layerSizes);
ann->setActivationFunction(cv::ml::ANN_MLP::SIGMOID_SYM);
cv::TermCriteria tc(cv::TermCriteria::MAX_ITER + cv::TermCriteria::EPS, 50, 0.1);
ann->setTermCriteria(tc);
ann->setTrainMethod(cv::ml::ANN_MLP::BACKPROP, 0.0001);
std::cout << "Result : " << ann->train(trainData, cv::ml::ROW_SAMPLE, resData) << std::endl;
ann->predict(trainData, predicted);
My training dataset looks like
Trained on 10 items dataset NN gives bad results on this(same) inputs. I tried different params
But trained on only 2 images NN gets close output (on trained data).
I suppose that it's not inappropriate approach and solution is not so easy.
Maybe someone has some advice about parameters or neural network architecture or whole approach.
It seems that the termination criteria were fine for just two samples but were not good enough when training with a larger number of samples. Do try adjusting them, and also the learning rate.
Judging by the quality of the pixels that have been restored properly, the network architecture seems to be fine for this task. Once the network works well on 10 samples, I strongly recommend adding more training samples.
The chief problem is that you have way to little data for the given network.
Your NN is fully connected. The weights for pixel 0,0 are entirely separate from those of pixel 1,0, and pixel 0,1 has again different weights. And you have a lot of weights, with so many nodes. So while you have plenty of pixels in 10 images, you have nowhere near enough pixels for all the weights.
A Convolutional Neural Network has far less weights, as many of its weights are reused. That means that in training, these weights are trained by multiple pixels from each training image.
Not that I'd expect this to work well with just 10 images. The human expectation is derived from years of human vision, literally billions of images.

Tensorflow for audio signal processing - detecting features intensity and delayes

For my studies I need to train a deep NN to identify certain sounds and their delays. We have 1X25K sample points (microphone output) and need quantification of events and their intensity.
In order to simplify the model to look more like the MNIST training procedure, for now we use the classification for the quantification (if there are two events each with intensity of 5 and 3, the output would be 8 and the delays vector).
we tried to throw the data [trainNum, 25000] to a 3 layered NN with 250, 100 and 50 neurons and adamoptimizer for three classes output as 100\ 010\001 [trainNum, 3] . The cost is not reducing from 400 and accuracy is 30%.
Please would appreciate any help and comments.
additional information: 2700 samples, 270 batches, 10 epochs. Used the following tutorial and changed the data from the MNIST to out sound data - https://pythonprogramming.net/tensorflow-neural-network-session-machine-learning-tutorial/
Thank you in advance
All the best,
AA

Scikit-learn RandomForestClassifier() feature selection, just select the train set?

I'm using scikit-learn for machine learning.
I have 800 samples with 2048 features, therefore I want to reduce my features to get hopefully a better accuracy.
It is a multiclass problem (class 0-5), and the features consists of 1's and 0's: [1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0....,0]
I'm using the ensemble method, RandomForestClassifier().
Should I just feature select the training data ?
Is it enough if I'm using this code:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = .3 )
clf = RandomForestClassifier( n_estimators = 200,
warm_start = True,
criterion = 'gini',
max_depth = 13
)
clf.fit( X_train, y_train ).transform( X_train )
predicted = clf.predict( X_test )
expected = y_test
confusionMatrix = metrics.confusion_matrix( expected, predicted )
Cause the accuracy didn't get higher. Is everything ok in the code or am I doing something wrong?
I'll be very grateful for your help.
I'm not sure I understood your question correctly so I'll answer to what I thought I understood =)
First, reducing the dimension of your features (from 2048 to 500 e.g.) might not provide you with better results. It all depends on the capacity of your model to catch the geometry of your data. You can get much better results for example with a linear model if you reduce dimension through non-linear methods that would catch a particular geometry and 'linearize' it, instead of directly using this linear model on the raw data. But this is because your data would intrinsicaly be non-linear and the linear model is not good therefore in the original space to catch this geometry (think of a circle in 2D).
In the code you gave, you did not reduce dimension though, you splitted the data into two dataset (feature dimension is the same, 2048, only the number of samples changed). Training on a smaller dataset most of the time results in worst accuracy (data = information, when you leave some out you lose information). But splitting data allows you to test overfitting in particular, which is very impotant. But once the best parameters chosen (see cross-validation) you should learn on all the data you have!
Given your 0.7*800=560 samples, I think a depth of 13 is pretty big and you might overfit. You may want to play with this parameter first if you want to improve your accuracy!
1) Often reducing the features space does not help with accuracy, and using a regularized classifier leads to better results.
2) To do feature selection, you need two methods: one to reduce the set of features, another that does the actual supervised task (classification here).
Have you tried just using the standard classifiers? Clearly you tried the RF, but I'd also try a linear method like LinearSVC/LogisticRegression or a kernel SVC.
If you want to do feature selection, what you need to do is something like this:
feature_selector = LinearSVC(penalty='l1') #or maybe start with SelectKBest()
feature_selector.train(X_train, y_train)
X_train_reduced = feature_selector.transform(X_train)
X_test_reduced = feature_selector.transform(X_test)
classifier = RandomForestClassifier().fit(X_train_reduced, y_train)
prediction = classifier.predict(X_test_reduced)
Or you use a pipeline, as here: http://scikit-learn.org/dev/auto_examples/feature_selection/feature_selection_pipeline.html
Maybe we should add a version without the pipeline to the examples?
[cross-posted from the mailing list where this was originally asked]
Dimensionality reduction or feature selection is definitely advisable if you have more features than samples. You could look into Principal Component Analysis and other modules in sklearn.decomposition to reduce the number of features. There is also a useful section on Feature Selection in the scikit-learn documentation.
After fitting sklearn.decomposition.PCA, you could inspect the explained_variance_ratio_ to determine an advisable number of features (n_components) to reduce to (the point of PCA here is to find a reduced number of features that captures most of the variance in your original feature space). Some might like to retain features that have a cumulative explained_variance_ratio_ above 0.9, 0.95 etc, some like to drop features beyond which the explained_variance_ratio_ drops suddenly. Then refit the PCA with the n_components you like, transform your X_train and X_test, and fit your classifier as above.

Check for similarity on different size images

I have a video source that produce many streams for different devices (such as: HD television, Pads, smart phones, etc.), every of them has to be checked within each other for similarity. The video stream release 50 images per second, one image every 20 milliseconds.
Lets take for instance img1 coming from stream1 at time ts1=1, img2 coming from stream2 at ts2=1 and img1.1 taken from stream1 at ts=2 (20 milliseconds later than ts=1), the comparison result should look something like this:
compare(img1, img1) = 1 same image same size
compare(img1, img2) = 0.9 same image different size
compare(img1, img1.1) = 0.8 different images same size
ideally this should be done real time, so within 20 millisecond, the goal is to understand if the streams are out of synchronization, I already implemented some compare methods (nobody of them works for this case yet):
1) histogram (SSE and OpenCV cuda), result compare(img1, img2) ~= compare(img1, img1.1)
2) pnsr (SSE and OCV cuda), result compare(img1, img2) < compare(img1, img1.1)
3) ssim (SSE and OCV cuda), resulting the same as pnsr
Maybe I get bad results because of the resize interpolation method?
Is it possible to realize a comparison method that fulfill my requirements, any ideas?
I'm afraid that you're running into a Real Problem (TM). This is not a trivial lets-give-it-to-the-intern problem.
The main challenge is that you can't do a brute-force comparison. HD images are 3 MB or more, and you're talking about O(N*M) comparisons (in time and across streams).
What you essentially need is a fingerprint that's robust against resizing but time-variant. And as you didn't realize that (the histogram idea for instance is quite time-stable, for instance) you didn't include the necessary information in this question.
So this isn't a C++ question, really. You need to understand your inputs.