I would like to know if anyone has had success labeling images with a continuous variables in the AutoML Vision platform.
Specifically, I would like to predict the height of a sand castle from a birds eye view photograph. I would train the model on birds eye view photographs of sandcastles labeled with the height in inches. I have 10,000 images in my data set. The range of heights in my data set is 1cm to 110cm, so the variable is continuous but not infinite.
Is this achievable through Google AutoML Vision?
Thank you!
As for now, there is no such feature on AutoML Vision that allows you to annotate images using continuous variables. As far as I understand, the labels are treated individually and the model will require "1000 training images per label. The minimum per label is 10, or 50 for advanced models".
A feasible approach with this would be to use discrete heights or ranges (1cm-20cm, 21cm-50cm, etc).
Related
Currently training a YOLO object detection model. I have 2 versions of the same dataset:
Contains full images, bounding boxes and labels
Contains segmented instances and labels
Which version is better to use? I'm inclined to go with the 2nd, but I'm worried that the pixels around the object, but still within the bounding box, can be important.
I've created a number of models using Google AutoML and I want to make sure I'm interpreting the output data correctly. This if for a linear regression model predicting website conversion rates on any given day.
First the model gives a model feature importance when the model has completed training. This seems to tell me which feature was most important in predicting the target value but not necessarily if it contributes most to larger changes in that value?
Secondly, we have a bunch of local feature weights which I think tell me the contribution each feature has made to prediction. So say feature weight of bounce rate has a weight of -0.002 we can say that the bounce rate for that row decreased the prediction by 0.002? Is there a correct way to aggregate that, is it just the range?
I have a dataset of around 20K images that are human labelled. Labels are as follows:
Label = 1 if the image is sharp and well lit, and
Label = 0 for those blurry/out of focus/grainy images.
The images are of documents such as Identity cards.
I want to build a Computer Vision model that can do the classification task.
I tried using VGG-16 for transfer learning for this task but it did not give good results (precision .65 and recall = .73). My sense is that VGG-16 is not suitable for this task. It is trained on ImageNet and has very different low level features. Interestingly the model is under-fitting.
We also tried EfficientNet 7. Though the model was able to decently perform on training and validation, test performance remains bad.
Can someone suggest more suitable model to try for this task?
I think your problem with VGG and other NN is the resizing of images:
VGG expects as input 224x224 size image. I assume your dataset has much larger resolution, and thus you significantly downscale the input images before feeding them to your network.
What happens to blur/noise when you downscale an image?
Blurry and noisy images become sharper and cleaner as you decrease the resolution. Therefore, in many of your training examples, the net sees a perfectly good image while you label them as "corrupt". This is not good for training.
An interesting experiment would be to see what types of degradations your net can classify correctly and what types it fails: You report 65% precision # 73% recall. Can you look at the classified images at that point and group them by degradation type?
That is, what is precision/recall for only blurry images? what is it for noisy images? What about grainy images?
What can you do?
Do not resize images at all! if the network needs fixed size input - then crop rather than resize.
Taking advantage of the "resizing" effect, you can approach the problem using a "discriminator". Train a network that "discriminate" between an image and its downscaled version. If the image is sharp and clean - this discriminator will find it difficult to succeed. However, for blurred/noisy images the task should be rather easy.
For this task, I think using opencv is sufficient to solve the issue. In fact comparing the variance of Lablacien of the image with a threshold (cv2.Laplacian(image, cv2.CV_64F).var()) will generate a decision if an image is bluered or not.
You ca find an explanation of the method and the code in the following tutorial : detection with opencv
I think that training a classifier that takes the output of one of one of your neural network models and the variance of Laplacien as features will improve the classification results.
I also recommend experementing with ResNet and DenseNet.
I would look at the change in color between pixels, then rank the photos on the median delta between pixels... a sharp change from RGB (0,0,0) to (255,255,255) on each of the adjoining pixels would be the max possible score, the more blur you have the lower the score.
I have done this in the past trying to estimate areas of fields with success.
I am trying to train a pedestrian detector using dlib and the the INRIA Person Dataset.
So far I used 27 images, the training is fast but the results are unsatisfying (on other images pedestrians are rarely recognized). Here is the result of my training using the train_object_detector program that comes with dlib (in /exmaples directory) :
Saving trained detector to object_detector.svm
Testing detector on training data...
Test detector (precision,recall,AP): 1 0.653061 0.653061
Parameters used:
threads: 4
C: 1
eps: 0.01
target-size: 6400
detection window width: 47
detection window height: 137
upsample this many times : 0
I am aware that other images need to be added to the training in order to have better results but before doing that I want to be sure of the meaning of every parameter printed in the result (precision, recall, AP, c, eps, ...) I am also wondering if you have any recommandations regarding the training : what images to choose ? how many images are needed ? Do I need to annotate every object in the image ? Do I need to ignore some regions in the image ? ...
One last question, is there any trained detector (svm file) that I can use to compare my results ?
Thank you for your answers
I am not familiar with dlib in particular, but let me tell you that you will not get good results with 27 images. In order to generalize well, your classifier needs to see many images with a variety of data. It won't do you any good to supply it with 10,000 images of the same person, wearing the same outfit. You want different people, clothing, settings, angles, and lighting. The INRIA dataset should cover most of those.
Your detection window dimensions and upsample settings will determine how large people must look in the image in order for your trained classifier to detect them reliably. Your settings will detect only people at 1 scale where they are around 137/47 pixels tall/wide. If you upsample even once, you'll be able to detect people at a smaller scale (upsampling makes the person look bigger than they are). I suggest you use a larger dataset and increase the upsampling number (by how much you upsample is another discussion - that appears to be built into the library). Things will take longer, but that is the nature of training classifiers - tweak parameters, retrain, compare the results.
For precision/recall I'll refer you to this wikipedia article. These are not parameters, but results of your classifier. You want both to be as close to 1 as possible.
Currently I am following the caffe imagenet example but apply it on my own training data set. My dataset is about 2000 classes and about 10 ~ 50 images each class. Actually I was classifying vehicle images and the images were cropped to the front, so the images within each class have the same size, the same view angle(almost).
I've tried the imagenet schema but looks like it didn't work well and after about 3000 iterations the accuracy was down to 0. So I am wondering is there a practical guide on how to tune the schema?
You can delete the last layer in imagenet, add your own last layer with a different name(to fit the number of classes), specify it with a higher learning rate, and specify a lower overall learning rate. There does exist an official example here: http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html
However, if the accuracy was 0 you should check the model parameters first, perhaps it's an overflow