Tuning Sagemaker Pipeline - amazon-web-services

Is it possible to do a hyper-parameter optimization on Sagemaker for a flow (e.g., pipeline) of a processing job followed by a training job?
In Sagemaker pipelines, I see I can use the tuner step with any training step. However, I can't see any helpful resource for integrating the processing job into the optimization.
Any ideas on how to do this task without merging two steps into one step?
In this relatively old question, it was asking about optimizing 2 models jointly. Here, I am asking about Processing and Training jobs.
I really appreciate any help you can provide.

There is no way to run a HyperparameterTuner for Processor, it expects an estimator as an input. On the other hand, you can just put your processing logic with hyperparameters into an appropriate Estimator (sklearn for example) and then output processing script with hyperparameters as an artefact from /opt/ml/model/model.joblib into a S3 or as a model artefact. When done tuning, just load it into Processor via model_dir, and you got your Processor with tuned hyperparameters.

Related

how to increase performance in aws comprehend on custom classification

I trained a custom classifier with simply two tag in CSV
I have feed my custom classification model with 1000 text each
but when I run a job in my custom classification model, the job take ~5 min (running) for analyses one new text, I search about this issue in AWS, but I don't find any answer...
How can I speed up / optimize my job for analysis new text with the model ?
Thank you in advance
Prior to Nov 2019, Comprehend only supported asynchronous inference for Custom classification. Asynchronous inference is optimized for bulk processing.
Comprehend has since launched real-time inference for Custom classification to satisfy the real-time needs of our customers.
https://docs.aws.amazon.com/comprehend/latest/dg/custom-sync.html
Note that Custom endpoints are charged by time units even when you're not actively using them. You can also look at the pricing document for details - https://aws.amazon.com/comprehend/pricing/

AWS Sagemaker - using cross validation instead of dedicated validation set?

When I train my model locally I use a 20% test set and then cross validation. Sagameker seems like it needs a dedicated valdiation set (at least in the tutorials I've followed). Currently I have 20% test, 10% validation leaving 70% to train - so I lose 10% of my training data compared to when I train locally, and there is some performance loss as a results of this.
I could just take my locally trained models and overwrite the sagemaker models stored in s3, but that seems like a bit of a work around. Is there a way to use Sagemaker without having to have a dedicated validation set?
Thanks
SageMaker seems to allow a single training set while in cross validation you iterate between for example 5 different training set each one validated on a different hold out set. So it seems that SageMaker training service is not well suited for cross validation. Of course cross validation is usually useful with small (to be accurate low variance) data, so in those cases you can set the training infrastructure to local (so it doesn't take a lot of time) and then iterate manually to achieve cross validation functionality. But it's not something out of the box.
Sorry, can you please elaborate which tutorials you are referring to when you say "SageMaker seems like it needs a dedicated validation set (at least in the tutorials I've followed)."
SageMaker training exposes the ability to separate datasets into "channels" so you can separate your dataset in whichever way you please.
See here for more info: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-trainingdata

Is it possible to parallelize preprocessings with tensorflow-transform on my machine?

I am trying to preprocess larger amounts of data (one tfrecord file ~1Go) using tensorflow-transform v0.11.0 and beam only locally.
My code is largely inspired from https://github.com/tensorflow/transform/blob/master/examples/census_example.py
I have a beam pipeline that works on smaller datasets (<100Mo) but the processing time increases dramatically as I add more data. Being new to tf-transform and apache Beam, I have a hard time finding causes and solutions to the problem... And I would like to avoid using google DataFlow.
My pipeline runs locally using beam directRunner, if I understood correctly, but it uses only one core. Using multiple cores could be one way to improve my preprocessing time, but I do not know if that is possible with the beam directRunner. Is there a way to make a tensorflow-transform pipeline run on multiple cores on my machine ?
I looked in the options of the beam pipeline and of the directRunner, and I can't find any indication about letting a runner access multiple cores or creating multiple directRunners for a pipeline.
Thank you very much for any help I could get !
To add to Anton's comment,
You can utilize Apache Flink to run the pipeline in parallel. More details are summarized in Tensorflow transform on beams with flink runner
You will also have to set the parallelism according to the total number of cores and start those many Flink TaskManagers. My recommendation would be to set parallelism to (total number of cores/2)
I don't believe that's supported. Direct runner's main purpose is to make sure the pipeline implements Beam model correctly. It is not optimized for production use, and will probably actually introduce inefficiencies: https://beam.apache.org/documentation/runners/direct/
As a workaround you can manually start multiple direct runner pipelines to process different portions of data.
Better option would be to use an actual parallel runner to run these kinds of jobs, e.g. you can spin up a Flink cluster: https://beam.apache.org/documentation/runners/flink/
#Ankur #Anton Thanks for your answers, I agree that this approach is not production friendly... We will try two other solutions:
tensorflow-transform on DataFlow
removing tensorflow-transform altogether and use presto to get vocabulary files for categorical inputs, compute means and standard deviations to scale numerical inputs, etc on the whole dataset

Cloud ML: Varying training time taken for the same data

I am using Google Cloud ML to for training jobs. I observe a peculiar behavior in which I observe varying time taken for the training job to complete for the same data. I analyzed the CPU and Memory utilization in the cloud ML console and see very similar utilization in both the cases(7min and 14mins).
Can anyone let me know what would be the reason for the service to take inconsistent time for the job to complete.
I have the same parameters and data in both the cases and also verified that the time spent in the PREPARING phase is pretty much the same in both cases.
Also would it matter that I schedule simultaneous multiple independent training job on the same project, if so then would like to know the rationale behind it.
Any help would be greatly appreciated.
The easiest way is to add more logging to inspect where the time was spent. You can also inspect training progress using TensorBoard. There's no VM sharing between multiple jobs, so it's unlikely caused by simultaneous jobs.
Also, the running time should be measured from the point when the job enters RUNNING state. Job startup latency varies depending on it's cold or warm start (i.e., we keep the VMs from previous job running for a while).

How to make my datalab machine learning run faster

I got some data, which is 3.2 million entries in a csv file. I'm trying to use CNN estimator in tensorflow to train the model, but it's very slow. Everytime I run the script, it got stuck, like the webpage(localhost) just refuse to respond anymore. Any recommendations? (I've tried with 22 CPUs and I can't increase it anymore)
Can I just run it and use a thread, like the command line python xxx.py & to keep the process going? And then go back to check after some time?
Google offers serverless machine learning with TensorFlow for precisely this reason. It is called Cloud ML Engine. Your workflow would basically look like this:
Develop the program to train your neural network on a small dataset that can fit in memory (iron out the bugs, make sure it works the way you want)
Upload your full data set to the cloud (Google Cloud Storage or BigQuery or &c.) (documentation reference: training steps)
Submit a package containing your training program to ML Cloud (this will point to the location of your full data set in the cloud) (documentation reference: packaging the trainer)
Start a training job in the cloud; this is serverless, so it will take care of scaling to as many machines as necessary, without you having to deal with setting up a cluster, &c. (documentation reference: submitting training jobs).
You can use this workflow to train neural networks on massive data sets - particularly useful for image recognition.
If this is a little too much information, or if this is part of a workflow that you'll be doing a lot and you want to get a stronger handle on it, Coursera offers a course on Serverless Machine Learning with Tensorflow. (I have taken it, and was really impressed with the quality of the Google Cloud offerings on Coursera.)
I am sorry for answering even though I am completely igonorant to what datalab is, but have you tried batching?
I am not aware if it is possible in this scenario, but insert maybe only 10 000 entries in one go and do this in so many batches that eventually all entries have been inputted?