I've created a number of models using Google AutoML and I want to make sure I'm interpreting the output data correctly. This if for a linear regression model predicting website conversion rates on any given day.
First the model gives a model feature importance when the model has completed training. This seems to tell me which feature was most important in predicting the target value but not necessarily if it contributes most to larger changes in that value?
Secondly, we have a bunch of local feature weights which I think tell me the contribution each feature has made to prediction. So say feature weight of bounce rate has a weight of -0.002 we can say that the bounce rate for that row decreased the prediction by 0.002? Is there a correct way to aggregate that, is it just the range?
Related
I trained Vertex AI forecasting AutoML model one with target column as String and other numeric input features as String then I trained another AutoML model with target column as float and other input features as Integer.
The predictions are different for both the models. The data is same only the datatypes/schema changed.
Google documentation says:
When you train a model with a feature with a numeric transformation,
Vertex AI applies the following data transformations to the feature,
and uses any that provide signal for training:
The value converted to float32.
So both the data should be same even after transformation.
Why would results be different? Is it possible?
I have follow the steps to have a forecasting model as show on Build an AutoML Forecasting Model with Vertex AI and reach the conclusion that vertex AI compress a lot of the steps of the prediction model generation so it can be easily operate by users.
I think the most reasonable answer for your observation among strings and numeric values resides in the way data processing is performed to generate our prediction models. I think you will not find inside vertex AI documentation as it would mean to disclose how vertex AI code works and handles its Feature Engineering and train steps to generate the models, which is protected.
Regardless, Lets speculate a bit, I think the difference among datatypes conversion might occur when datatype is converted and passed to the algorithm for processing. Lets said a linear regression sample, you will find that the slightest variation on data conversion can affect the outcome of your prediction model which could also be what is happening here.
I'd like to know how the training performance changes over the course of the training. Is there any way to access that via Vertex AI automl service?
Unfortunately it is not possible to see the training performance over the course of the training. Vertex AI Auto ML only shows if the training job is running or not.
The only available information is "how well did model performed with the test set after training". This can seen in the "Evaluation" tab in AutoML. You can refer to Vertex AI Auto ML Evaluation for further readings.
AutoML provides evaluation metrics that could help you determine the performance of your model. Some of the evaluation metrics are precision, recall, and confidence thresholds. These vary depending on what AutoML product you are using.
For example if you have an image classification model, the following are the available evaluation metrics:
AuPRC: The area under the precision-recall (PR) curve, also referred to as average precision. This value ranges from zero to one, where a
higher value indicates a higher-quality model.
Log loss: The cross-entropy between the model predictions and the target values. This ranges from zero to infinity, where a lower value
indicates a higher-quality model.
Confidence threshold: A confidence score that determines which predictions to return. A model returns predictions that are at this
value or higher. A higher confidence threshold increases precision but
lowers recall. Vertex AI returns confidence metrics at different
threshold values to show how the threshold affects precision and
recall.
Recall: The fraction of predictions with this class that the model correctly predicted. Also called true positive rate. Precision: The
fraction of classification predictions produced by the model that were
correct.
Confusion matrix: A confusion matrix shows how often a model correctly predicted a result. For incorrectly predicted results, the
matrix shows what the model predicted instead. The confusion matrix
helps you understand where your model is "confusing" two results.
I'm using the detectron2 with cascade rcnn.
I'm predicting 4 different classes. and dataset with around 6000 object boxes with training.
I used resnet 50 for backbone and got around 80 accuracy.
then I try it with resnet 101 back bone and it diverged with first iteration.
1. small dataset and big model easily diverged at the first iteration?
2. what heppen to small dataset and big back bone
3. And for my case augmentation + resnet50 backbone decrease the result why?
4. How should I fix this diverged problem?(what is the step next)
As we can see in the detectron2 model zoo for Cascade R-CNN, their pre-trained models only use ResNet50 as backbone (config 1 and config 2).
So the most likely reason your model diverges is that most of the parameters will need to be trained from scratch (which is much more difficult and will require more data) as there is no pre-trained ResNet101.
Please also note that using a larger model increases the risk of overfitting.
I am comparing models for the detection of objects for maritime Search and Rescue (SAR) purposes. From the models that I used, I got the best results for the improved version of YOLOv3 for small object detection and for FASTER RCNN.
For YOLOv3 I got the best mAP#50, but for FASTER RCNN I got better all other metrics (precision, recall, F1 score). Now I am wondering how to read it and which model is really better in this case?
I would like to add that there are only two classes in the dataset: small and large objects. We chose this solution because the objects' distinction between classes is not as important to us as the detection of any human origin object.
However, small objects don't mean small GT bounding boxes. These are objects that actually have a small area - less than 2 square meters (e.g. people, buoys). Large objects are objects with a larger area (boats, ships, canoes, etc.).
Here are the results per category:
And two sample images from the dataset (with YOLOv3 detections):
The mAP for object detection is the average of the AP calculated for all the classes. mAP#0.5 means that it is the mAP calculated at IOU threshold 0.5.
The general definition for the Average Precision(AP) is finding the area under the precision-recall curve.
The process of plotting the model's precision and recall as a function of the model’s confidence threshold is the precision recall curve.
Precision measures how accurate is your predictions. i.e. the percentage of your predictions that are correct. Recall measures how good you find all the positives. F1 score is HM (Harmonic Mean) of precision and recall.
To answer your questions now.
How to read it and which model is really better in this case?
The mAP is a good measure of the sensitivity of the neural network. So good mAP indicates a model that's stable and consistent across difference confidence thresholds. In your case faster rcnn results indicate that precision-recall curve metric is bad compared to that of Yolov3, which means that either faster rcnn has very bad recall at higher confidence thresholds or very bad precision at lower confidence threshold compared to that of Yolov3 (especially for small objects).
Precision, Recall and F1 score are computed for given confidence threshold. I'm assuming you're running the model with default confidence threshold (could be 0.25). So higher Precision, Recall and F1 score of faster rcnn indicate that at that confidence threshold it's better in terms of all the 3 metric compared to that of Yolov3.
What metric should be more important?
In general to analyse better performing model, I would suggest you to use validation set (data set that is used to tune hyper-parameters) and test set (data set that is used to assess the performance of a fully-trained model).
Note: FP - False Positive FN - False Negative
On validation set:
Use mAP to select best performing model (model that is more stable and consistent) out of all the trained weights across iterations/epochs. Use mAP to understand whether model should be trained/tuned further or not.
Check class level AP values to ensure model is stable and good across the classes.
As per use-case/application, if you're completely tolerant to FNs and highly intolerant to FPs then to train/tune the model accordingly use Precision.
As per use-case/application, if you're completely tolerant to FPs and highly intolerant to FNs then to train/tune the model accordingly use Recall.
On test set:
If you're neutral towards FPs and FNs, then use F1 score to evaluate best performing model.
If FPs are not acceptable to you (without caring much about FNs) pick the model with higher Precision
If FNs are not acceptable to you (without caring much about FPs) pick the model with higher Recall
Once you decide metric you should be using, try out multiple confidence thresholds (say for example - 0.25, 0.35 and 0.5) for given model to understand for which confidence threshold value the metric you selected works in your favour and also to understand acceptable trade off ranges (say you want Precision of at least 80% and some decent Recall). Once confidence threshold is decided, you use it across different models to find out best performing model.
I am a frequent user of scikit-learn, I want some insights about the “class_ weight ” parameter with SGD.
I was able to figure out till the function call
plain_sgd(coef, intercept, est.loss_function,
penalty_type, alpha, C, est.l1_ratio,
dataset, n_iter, int(est.fit_intercept),
int(est.verbose), int(est.shuffle), est.random_state,
pos_weight, neg_weight,
learning_rate_type, est.eta0,
est.power_t, est.t_, intercept_decay)
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py
After this it goes to sgd_fast and I am not very good with cpython. Can you give some celerity on these questions.
I am having a class biased in the dev set where positive class is somewhere 15k and negative class is 36k. does the class_weight will resolve this problem. Or doing undersampling will be a better idea. I am getting better numbers but it’s hard to explain.
If yes then how it actually does it. I mean is it applied on the features penalization or is it a weight to the optimization function. How I can explain this to layman ?
class_weight can indeed help increasing the ROC AUC or f1-score of a classification model trained on imbalanced data.
You can try class_weight="auto" to select weights that are inversely proportional to class frequencies. You can also try to pass your own weights has a python dictionary with class label as keys and weights as values.
Tuning the weights can be achieved via grid search with cross-validation.
Internally this is done by deriving sample_weight from the class_weight (depending on the class label of each sample). Sample weights are then used to scale the contribution of individual samples to the loss function used to trained the linear classification model with Stochastic Gradient Descent.
The feature penalization is controlled independently via the penalty and alpha hyperparameters. sample_weight / class_weight have no impact on it.