How can GCP Automl handle overfitting? - google-cloud-platform

I have created a Vertex AI AutoML image classification model. How can I assess it for overfitting? I assume I should be able to compare training vs validation accuracy but these do not seem to be available.
And if it is overfitting,can I tweak regularization parameters? Is it already doing cross validation? Anything else that can be done? (More data,early stopping, dropouts ie how can these be done?)

Deploy it to endpoint and test result with sample images by uploading to endpoint. If it's overfitting you can see the stats in analysis. You can increase the training sample and retrain your model again to get better result.

Related

How can I resolve imbalanced datasets for AutoML classification on GCP?

I am planning to use AutoML for the classification of my tabular data. But there is a moderate imbalance in the target variable.
When running my own model, I would either upsample, downsample or build synthetic samples to resolve the imbalance.
Is there such a possibility on AutoML on GCP? If not, how can one resolve such cases?
Auto ML Tabular Data Classification
AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. In general, the more training examples you have, the better your outcome. The amount of example data required also scales with the complexity of the problem you're trying to solve. See guide on number of data to use.
So with regards to the imbalance in your dataset, the only way to resolve this case is to adjust the data (add or remove samples) for you to achieve optimal results.
For more information you can refer to AutoML Tables guide.

Google Video Intelligence AutoML inference

I've trained a model using Google's AutoML Video Intelligence and now trying to make predictions on a video of just 2 seconds using nodejs client's batch prediction but the inference time is no where near production grade, it's taking almost like a minute to make prediction on just 2 seconds of video. Am I missing some setting here or it's the way it is right now?
Some findings on this issue:
Try to follow best practices and see how to improve model performance
I have found another latency issue in google groups for auto ml, it suggests that If you're putting the base64-encoded bytes directly in 'inputContent' you might want to consider uploading the input_video file directly to Google Cloud Storage and using 'inputUri' instead of ‘inputContent.’ This will reduce the request payload size and the upload-latency.
This might be caused by a quota limit, you can confirm in the logs (by job id) for quota errors
Finally, you can open an issue at the Public Issue Tracker with a sample video and command for issue reproduction and further investigation.
Good luck!

Google Cloud Platform

I am building a classification model using AutoML and I have some basic usage questions about the GCP.
1 - Data privacy question; if we save behavior data to train our model in BigQuery, does Google have access to that data? Could Google ever use that data to learn more about behavior of individuals we collected data from?
2 - Since training costs are charged by the hour, I would like to understand the relationship between data and training time. Does the time increase linearly with the size of the training data set? For example, we trained a classification using 1.7MB of data and it took 3 hrs. So, would training a model with 17MB of data take 30 hours?
3 - A batch prediction costs 1.16 USD per hour. However, our data is in a csv and it seems that we cannot upload a csv to do a batch prediction. So, we will try using the API. Therefore I have two questions: A) can we do a batch upload using the API and B) what are the associated costs?
4 - What exactly is an online prediction?
5 - When using the cost calculator (for machine learning), what is a node hour?
1- As is mentioned in the Data Usage FAQ, Google does not use any of your content for any purpose except to provide you with the Cloud AutoML service.
2- The time required to train your model depends on the size and complexity of your training data, for detailed explanation take a look at the Vision documentation for example.
3- You need to upload your csv file to Google Cloud Storage and then you can use it in the API or any of the available client libraries. See Natural Language batch prediction, for example. For costs, check the documentation for the desired product. AutoML pricing depends on what feature you are using: Vision, Natural Language, Translation, Video Intelligence.
4- After you have created (trained) a model, you can deploy the model and request online (single, low-latency and real-time) predictions. Online predictions accept one row of data and provide a predicted result based on your model for that data. You use online predictions when you need a prediction as input for your business logic flow.
5- You can think of node as a single Virtual Machine, which resources are used for computing purposes. Machine types are different depending the product and purpose for which they are used. For example in image classification, the cost for AutoML Vision Image Classification model training is $3.15 per node hour, each node is equivalent to a n1-standard-8 machine with an attached NVIDIA Tesla V100.GPU. Then, node hour are the resources of such node used by one hour.

Is it possible to undeploy a Google Cloud AutoML Vision Image Classification model when it's not being used?

I know how to undeploy/redeploy an Object Detection model - but my overall project uses both Object Detection and Image Classification. What's the best way to save money when we're not using both?
It's easy to remove the deployment of the Object Detection model, and then re-deploy it when we have data to process. Can the same be done for the Image Classification models?
In both Object Detection and Image Classification you pay based on resource usage.
Regarding your question, it’s important to take into account that you pay per node deployed as the model’s associated resources remain allocated in order to prevent delays in your predictions. That’s why in order to not incur charges when you are not using the service you should undeploy the models. You can do this in both Object Detection and Image Classification.

AWS Sagemaker - using cross validation instead of dedicated validation set?

When I train my model locally I use a 20% test set and then cross validation. Sagameker seems like it needs a dedicated valdiation set (at least in the tutorials I've followed). Currently I have 20% test, 10% validation leaving 70% to train - so I lose 10% of my training data compared to when I train locally, and there is some performance loss as a results of this.
I could just take my locally trained models and overwrite the sagemaker models stored in s3, but that seems like a bit of a work around. Is there a way to use Sagemaker without having to have a dedicated validation set?
Thanks
SageMaker seems to allow a single training set while in cross validation you iterate between for example 5 different training set each one validated on a different hold out set. So it seems that SageMaker training service is not well suited for cross validation. Of course cross validation is usually useful with small (to be accurate low variance) data, so in those cases you can set the training infrastructure to local (so it doesn't take a lot of time) and then iterate manually to achieve cross validation functionality. But it's not something out of the box.
Sorry, can you please elaborate which tutorials you are referring to when you say "SageMaker seems like it needs a dedicated validation set (at least in the tutorials I've followed)."
SageMaker training exposes the ability to separate datasets into "channels" so you can separate your dataset in whichever way you please.
See here for more info: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-trainingdata