How can I resolve imbalanced datasets for AutoML classification on GCP? - google-cloud-platform

I am planning to use AutoML for the classification of my tabular data. But there is a moderate imbalance in the target variable.
When running my own model, I would either upsample, downsample or build synthetic samples to resolve the imbalance.
Is there such a possibility on AutoML on GCP? If not, how can one resolve such cases?
Auto ML Tabular Data Classification

AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. In general, the more training examples you have, the better your outcome. The amount of example data required also scales with the complexity of the problem you're trying to solve. See guide on number of data to use.
So with regards to the imbalance in your dataset, the only way to resolve this case is to adjust the data (add or remove samples) for you to achieve optimal results.
For more information you can refer to AutoML Tables guide.

Related

Stepwise regression in Google BigQuery

How to perform stepwise regression in GCP BigQueryML? The purpose is to identify which variables are significant and should be taken into consideration for creating statistical models.
Could not find any documentation on GCP.
You can see the BQML explanation with the command ML.GLOBAL_EXPLAIN which is documented here
For each feature, you have an attribution value that explain the influence of the feature on the model inference/prediction.

How can GCP Automl handle overfitting?

I have created a Vertex AI AutoML image classification model. How can I assess it for overfitting? I assume I should be able to compare training vs validation accuracy but these do not seem to be available.
And if it is overfitting,can I tweak regularization parameters? Is it already doing cross validation? Anything else that can be done? (More data,early stopping, dropouts ie how can these be done?)
Deploy it to endpoint and test result with sample images by uploading to endpoint. If it's overfitting you can see the stats in analysis. You can increase the training sample and retrain your model again to get better result.

Google Cloud Platform

I am building a classification model using AutoML and I have some basic usage questions about the GCP.
1 - Data privacy question; if we save behavior data to train our model in BigQuery, does Google have access to that data? Could Google ever use that data to learn more about behavior of individuals we collected data from?
2 - Since training costs are charged by the hour, I would like to understand the relationship between data and training time. Does the time increase linearly with the size of the training data set? For example, we trained a classification using 1.7MB of data and it took 3 hrs. So, would training a model with 17MB of data take 30 hours?
3 - A batch prediction costs 1.16 USD per hour. However, our data is in a csv and it seems that we cannot upload a csv to do a batch prediction. So, we will try using the API. Therefore I have two questions: A) can we do a batch upload using the API and B) what are the associated costs?
4 - What exactly is an online prediction?
5 - When using the cost calculator (for machine learning), what is a node hour?
1- As is mentioned in the Data Usage FAQ, Google does not use any of your content for any purpose except to provide you with the Cloud AutoML service.
2- The time required to train your model depends on the size and complexity of your training data, for detailed explanation take a look at the Vision documentation for example.
3- You need to upload your csv file to Google Cloud Storage and then you can use it in the API or any of the available client libraries. See Natural Language batch prediction, for example. For costs, check the documentation for the desired product. AutoML pricing depends on what feature you are using: Vision, Natural Language, Translation, Video Intelligence.
4- After you have created (trained) a model, you can deploy the model and request online (single, low-latency and real-time) predictions. Online predictions accept one row of data and provide a predicted result based on your model for that data. You use online predictions when you need a prediction as input for your business logic flow.
5- You can think of node as a single Virtual Machine, which resources are used for computing purposes. Machine types are different depending the product and purpose for which they are used. For example in image classification, the cost for AutoML Vision Image Classification model training is $3.15 per node hour, each node is equivalent to a n1-standard-8 machine with an attached NVIDIA Tesla V100.GPU. Then, node hour are the resources of such node used by one hour.

Is there a way to access and work with data stored in GCP bucket directly?

I have to do a deep learning project at my university, where I need to work with a medical image database. This database is stored in a Google Cloud Platform bucket.
However, the database's size is over 4 TB, so I can't afford download the data using gsutil. I can't use Google Colab notebook either, since it's disk storage size is 350GB.
Is there any way I can access the data and use it to teach my network?
I think you aren't on the right way.
When you build your model, you only need to have a representative subset of your dataset to validate your layers and the expected behavior.
Then, when all is done and packaged, you run your training job on dedicated VM (like Deep Learning VM). This process can be handle automatically by AI-Platform. You can also set up hyper-parameters server and parallelize your training.
In training phase, you often work with batches: you load only a subset of your dataset, you shuffle it and you train perform several steps on this subset (with RMSE/cross-entropy figure out, evaluation, gradient optimization).
Because you use a subset of your full dataset in batches, your don't need to have the 4Tb on your VM at the same time. Your training loop do it for you (download, train, evaluate, delete).
Like I said before, because you use a subset, you can also parallelize your training on several VMs for reducing your training duration.
I recommend you to review your training loop. If your give me the framework name/version which one you work, I could help you with tutorals and examples.

AWS Sagemaker - using cross validation instead of dedicated validation set?

When I train my model locally I use a 20% test set and then cross validation. Sagameker seems like it needs a dedicated valdiation set (at least in the tutorials I've followed). Currently I have 20% test, 10% validation leaving 70% to train - so I lose 10% of my training data compared to when I train locally, and there is some performance loss as a results of this.
I could just take my locally trained models and overwrite the sagemaker models stored in s3, but that seems like a bit of a work around. Is there a way to use Sagemaker without having to have a dedicated validation set?
Thanks
SageMaker seems to allow a single training set while in cross validation you iterate between for example 5 different training set each one validated on a different hold out set. So it seems that SageMaker training service is not well suited for cross validation. Of course cross validation is usually useful with small (to be accurate low variance) data, so in those cases you can set the training infrastructure to local (so it doesn't take a lot of time) and then iterate manually to achieve cross validation functionality. But it's not something out of the box.
Sorry, can you please elaborate which tutorials you are referring to when you say "SageMaker seems like it needs a dedicated validation set (at least in the tutorials I've followed)."
SageMaker training exposes the ability to separate datasets into "channels" so you can separate your dataset in whichever way you please.
See here for more info: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-trainingdata