AWS Sagemaker - using cross validation instead of dedicated validation set? - amazon-web-services

When I train my model locally I use a 20% test set and then cross validation. Sagameker seems like it needs a dedicated valdiation set (at least in the tutorials I've followed). Currently I have 20% test, 10% validation leaving 70% to train - so I lose 10% of my training data compared to when I train locally, and there is some performance loss as a results of this.
I could just take my locally trained models and overwrite the sagemaker models stored in s3, but that seems like a bit of a work around. Is there a way to use Sagemaker without having to have a dedicated validation set?
Thanks

SageMaker seems to allow a single training set while in cross validation you iterate between for example 5 different training set each one validated on a different hold out set. So it seems that SageMaker training service is not well suited for cross validation. Of course cross validation is usually useful with small (to be accurate low variance) data, so in those cases you can set the training infrastructure to local (so it doesn't take a lot of time) and then iterate manually to achieve cross validation functionality. But it's not something out of the box.

Sorry, can you please elaborate which tutorials you are referring to when you say "SageMaker seems like it needs a dedicated validation set (at least in the tutorials I've followed)."
SageMaker training exposes the ability to separate datasets into "channels" so you can separate your dataset in whichever way you please.
See here for more info: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-trainingdata

Related

Best way to run 1000s of training jobs on sagemaker

I have thousands of training jobs that I want to run on sagemaker. Basically I have a list of hyperparameters and I want to train the model for all of those hyperparmeters in parallel (not a standard hyperparameter tuning where we just want to optimize the hyperparameter, here we want to train for all of the hyperparameters). I have searched the docs quite extensively but it surprises me that I couldn't find any info about this, even though it seems like a pretty basic functionality.
For example, let's say I have 10,000 training jobs, and my quota is 20 instances, what is the best way to run these jobs utilizing all my available instances? In particular,
Is there a "queue manager" functionality that takes the list of hyperparameters and runs the training jobs in batches of 20 until they are all done (even better if it could keep track of failed/completed jobs).
Is it best practice to run a single training job per instance? If that's the case do I need to ask for a much higher quota on the number of instance?
If this functionality does not exist in sagemaker, is it worth using EC2 instead since it's a bit cheaper?
Your question is very broad and the best way forward would depend on other details of your use-case, so we will have to make some assumptions.
[Queue manager]
SageMaker does not have a queue manager. If at the end you decide you need a queue manager, I would suggest looking towards AWS Batch.
[Single vs multiple training jobs]
Since you need to run 10s of thousands job I assume you are training fairly lightweight models, so to save on time, you would be better off reusing instances for multiple training jobs. (Otherwise, with 20 instances limit, you need 500 rounds of training, with a 3 min start time - depending on instance type - you need 25 hours just for the wait time. Depending on the complexity of each individual model, this 25hours might be significant or totally acceptable).
[Instance limit increase]
You can always ask for a limit increase, but going from a limit of 20 to 10k at once is likely that will not be accepted by the AWS support team, unless you are part of an organisation with a track record of usage on AWS, in which case this might be fine.
[One possible option] (Assuming multiple lightweight models)
You could create a single training job, with instance count, the number of instances available to you.
Inside the training job, your code can run a for loop and perform all the individual training jobs you need.
In this case, you will need to know which which instance is which so you can make the split of the HPOs. SageMaker writes this information on the file: /opt/ml/input/config/resourceconfig.json so using that you can easily have each instance run a subset of the trainings required.
Another thing to think of, is if you need to save the generated models (which you probably need). You can either save everything in the output model directory - standard SM approach- but this would zip all models in a model.tar.gz file.
If you don't want this, and prefer to have each model individually saved, I'd suggest using the checkpoints directory that will sync anything written there to your s3 location.

Training multiple model in AWS Sagemaker

Can I train multiple model in AWS Sagemaker by evaluating the models is train.py script and also how to get back multiple metrics from multiple models?
Any links, docs or videos would be useful.
Yes, what you write in a sagemaker training script (assuming you use something that lets you pass custom code like your own container or a framework container) is flexible, and does not need to be just one model or even ML. You can definitely write multiple model trainings in a single container, and pull all related metrics using SageMaker metric capture via regex, see an example regex here with the Sklearn random forest.
That being said, it is often a better idea to separate things and have one model per SageMaker job, because of the following reasons among other:
It allows you to separate model metadata and metrics and compare
them easily with the SageMaker metadata service
It allows you to specialize hardware to each model and get better economics. Each model has its own sweet spot when it comes to CPU, GPU, RAM
It allows you to use the exact same container for single training but
also for bayesian hyperparameter search, an method that can be
both faster and cheaper than regular gridsearch.

Is there a way to access and work with data stored in GCP bucket directly?

I have to do a deep learning project at my university, where I need to work with a medical image database. This database is stored in a Google Cloud Platform bucket.
However, the database's size is over 4 TB, so I can't afford download the data using gsutil. I can't use Google Colab notebook either, since it's disk storage size is 350GB.
Is there any way I can access the data and use it to teach my network?
I think you aren't on the right way.
When you build your model, you only need to have a representative subset of your dataset to validate your layers and the expected behavior.
Then, when all is done and packaged, you run your training job on dedicated VM (like Deep Learning VM). This process can be handle automatically by AI-Platform. You can also set up hyper-parameters server and parallelize your training.
In training phase, you often work with batches: you load only a subset of your dataset, you shuffle it and you train perform several steps on this subset (with RMSE/cross-entropy figure out, evaluation, gradient optimization).
Because you use a subset of your full dataset in batches, your don't need to have the 4Tb on your VM at the same time. Your training loop do it for you (download, train, evaluate, delete).
Like I said before, because you use a subset, you can also parallelize your training on several VMs for reducing your training duration.
I recommend you to review your training loop. If your give me the framework name/version which one you work, I could help you with tutorals and examples.

aws sagemaker for detecting text in an image

I am aware that it is better to use aws Rekognition for this. However, it does not seem to work well when I tried it out with the images I have (which are sort of like small containers with labels on them). The text comes out misspelled and fragmented.
I am new to ML and sagemaker. From what I have seen, the use cases seem to be for prediction and image classification. I could not find one on training a model for detecting text in an image. Is it possible to to do it with Sagemaker? I would appreciate it if someone pointed me in the right direction.
The different services will all provide different levels of abstraction for Optical Character Recognition (OCR) depending on what parts of the pipeline you are most comfortable with working with, and what you prefer to have abstracted.
Here are a few options:
Rekognition will provide out of the box OCR with the DetectText feature. However, it seems you will need to perform some sort of pre-processing on your images in your current case in order to get better results. This can be done through any method of your choice (Lambda, EC2, etc).
SageMaker is a tool that will enable you to easily train and deploy your own models (of any type). You have two primary options with SageMaker:
Do-it-yourself option: If you're looking to go the route of labeling your own data, gathering a sizable training set, and training your own OCR model, this is possible by training and deploying your own model via SageMaker.
Existing OCR algorithm: There are many algorithms out there that all have different potential tradeoffs for OCR. One example would be Tesseract. Using this, you can more closely couple your pre-processing step to the text detection.
Amazon Textract (In preview) is a purpose-built dedicated OCR service that may offer better performance depending on what your images look like and the settings you choose.
I would personally recommend looking into pre-processing for OCR to see if it improves Rekognition accuracy before moving onto the other options. Even if it doesn't improve Rekognition's accuracy, it will still be valuable for most of the other options!

How to make my datalab machine learning run faster

I got some data, which is 3.2 million entries in a csv file. I'm trying to use CNN estimator in tensorflow to train the model, but it's very slow. Everytime I run the script, it got stuck, like the webpage(localhost) just refuse to respond anymore. Any recommendations? (I've tried with 22 CPUs and I can't increase it anymore)
Can I just run it and use a thread, like the command line python xxx.py & to keep the process going? And then go back to check after some time?
Google offers serverless machine learning with TensorFlow for precisely this reason. It is called Cloud ML Engine. Your workflow would basically look like this:
Develop the program to train your neural network on a small dataset that can fit in memory (iron out the bugs, make sure it works the way you want)
Upload your full data set to the cloud (Google Cloud Storage or BigQuery or &c.) (documentation reference: training steps)
Submit a package containing your training program to ML Cloud (this will point to the location of your full data set in the cloud) (documentation reference: packaging the trainer)
Start a training job in the cloud; this is serverless, so it will take care of scaling to as many machines as necessary, without you having to deal with setting up a cluster, &c. (documentation reference: submitting training jobs).
You can use this workflow to train neural networks on massive data sets - particularly useful for image recognition.
If this is a little too much information, or if this is part of a workflow that you'll be doing a lot and you want to get a stronger handle on it, Coursera offers a course on Serverless Machine Learning with Tensorflow. (I have taken it, and was really impressed with the quality of the Google Cloud offerings on Coursera.)
I am sorry for answering even though I am completely igonorant to what datalab is, but have you tried batching?
I am not aware if it is possible in this scenario, but insert maybe only 10 000 entries in one go and do this in so many batches that eventually all entries have been inputted?