How to prepare docker image for SageMaker training job

How to prepare docker image for SageMaker training job - amazon-web-services

Let's say I have docker image which has the training code for Machine Learning model. How can I adapt it for SageMaker Training Job so I can run the docker image there?

There are several things to keep in mind when adapting docker image for SageMaker Training Job:
Training code
Training code should be located in /opt/ml/code/ in the docker image and the main script should be /opt/ml/code/train. Also, the script should have permissions for executing it (chmod 777 /opt/ml/code/train does the trick). Also not critical, but useful - if you need to do any imports you may need to add code to the path export PATH="/opt/ml/code:${PATH}". By default SageMaker runs for training docker run image train, but alternatively you can use a custom entrypoint so that you can keep the original path for the training code.
Hyperparameters
Hyperparameters which are provided as part of HyperParameters parameters are saved as json file in /opt/ml/input/config/hyperparameters.json so your training code has to read them from there. Keep in mind that it only supports string fields and has the following limits:
Map Entries: Minimum number of 0 items. Maximum number of 100 items.
Key Length Constraints: Maximum length of 256.
Value Length Constraints: Maximum length of 2500.
Output
To save the model, it should be put all files into /opt/ml/model/ folder - SageMaker compresses all the files here into the TAR format and saves it to S3 in the location specified in the training job.
[Optional] Metrics
Metrics are gathered by running regular expressions on the logs produced by the container. These are defined in metrics definition of the training job where you can provide regular expressions for existing logs (e.g. loss after each epoch with val_loss: (.*)). The minimal thing to enable metric-friendly log is the following - Log f"Value = {value}", corresponding regex Value = (.*), corresponding metric definition {"Name": "value", "Regex": "Value = (.*)"}.
Important documentation:
Container structure - https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-toolkits.html
How the container is started - https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html
Output - https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html

Related

GCP Vertex AI Training: Auto-packaged Custom Training Job Yields Huge Docker Image

I am trying to run a Custom Training Job in Google Cloud Platform's Vertex AI Training service.
The job is based on a tutorial from Google that fine-tunes a pre-trained BERT model (from HuggingFace).
When I use the gcloud CLI tool to auto-package my training code into a Docker image and deploy it to the Vertex AI Training service like so:
$BASE_GPU_IMAGE="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest"
$BUCKET_NAME = "my-bucket"
gcloud ai custom-jobs create `
--region=us-central1 `
--display-name=fine_tune_bert `
--args="--job_dir=$BUCKET_NAME,--num-epochs=2,--model-name=finetuned-bert-classifier" `
--worker-pool-spec="machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=.,python-module=trainer.task"
... I end up with a Docker image that is roughly 18GB (!) and takes a very long time to upload to the GCP registry.
Granted the base image is around 6.5GB but where do the additional >10GB come from and is there a way for me to avoid this "image bloat"?
Please note that my job loads the training data using the datasets Python package at run time and AFAIK does not include it in the auto-packaged docker image.

The image size shown in the UI is the virtual size of the image. It is the compressed total image size that will be downloaded over the network. Once the image is pulled, it will be extracted and the resulting size will be bigger. In this case, the PyTorch image's virtual size is 6.8 GB while the actual size is 17.9 GB.
Also, when a docker push command is executed, the progress bars show the uncompressed size. The actual amount of data that’s pushed will be compressed before sending, so the uploaded size will not be reflected by the progress bar.
To cut down the size of the docker image, custom containers can be used. Here, only the necessary components can be configured which would result in a smaller docker image. More information on custom containers here.

Used Dataflow's DLP to read from GCS and write to BigQuery - Only 50% data written to BigQuery

I recently started a Dataflow job to load data from GCS and run it through DLP's identification template and write the masked data to BigQuery. I could not find a Google-provided template for batch processing hence used the streaming one (ref: link).
I see only 50% of the rows are written to the destination BigQuery table. There is no activity on the pipeline for a day even though it is in the running state.

yes DLP Dataflow template is a streaming pipeline but with some easy changes you can also use it as batch. Here is the template source code. As you can see it uses File IO transform and poll/watch for any new file in every 30 seconds. if you take out the window transform and continuous polling syntax, you should be able to execute as batch.
In terms of pipeline not progressing all data, can you confirm if you are running a large file with default settings? e.g- workerMachineType, numWorkers, maxNumWorkers? Current pipeline code uses a line based offsetting which requires a highmem machine type with large number of workers if the input file is large. e.g for 10 GB, 80M lines you may need 5 highmem workers.
One thing you can try to see if it helps is to trigger the pipeline with more resources e.g: --workerMachineType=n1-highmem-8, numWorkers=10, maxNumWorkers=10 and see if it's any better.
Alternatively, there is a V2 solution that uses byte based offsetting using state and timer API for optimized batching and resource utilization that you can try out.

What is the best way to feed image data (tfrecords) from GCS to your model?

I set myself a goal to solve the MNIST Skin Cancer dataset using only Google Cloud.
Using GCS & Kubeflow on Google Kubernetes.
I converted the data from jpeg to tfrecord with the following script:
https://github.com/tensorflow/tpu/blob/master/tools/datasets/jpeg_to_tf_record.py
I have seen a lot of examples how they feed a csv-file to their model but no examples with image data.
Should it be smart to copy all the tfrecords to the Google Cloud Shell so I can feed the data to my model like that?
Or are there any better methods available?
Thanks in advance.

In the case you are using Kubeflow, I would suggest to use the kubeflow pipelines.
For the preprocessing you could use an image that is build on top of the standard pipeline dataflow image gcr.io/ml-pipeline/ml-pipeline-dataflow-tft:latest where you simply copy your dataflow code and run it:
FROM gcr.io/ml-pipeline/ml-pipeline-dataflow-tft:latest
RUN mkdir /{folder}
COPY run_dataflow_pipeline.py /{folder}
ENTRYPOINT ["python", "/{folder}/run_dataflow_pipeline.py"]
See this boilerplate for the dataflow code that does exactly this. The idea is that you write the TF records to Google Cloud Storage (GCS).
Subsequently you could use Google Cloud's ML engine for the actual training. In this case you can start also from the image google/cloud-sdk:latest and basically copy over the required files with probably a bash script that will be run to execute the gcloud commands to start the training job.
FROM google/cloud-sdk:latest
RUN mkdir -p /{src} && \
cd /{src}
COPY train.sh ./
ENTRYPOINT ["bash", "./train.sh"]
An elegant way to pass on the storage location of your TF records into your model is to use TF.data:
# Construct a TFRecordDataset
train_records = [os.path.join('gs://{BUCKET_NAME}/', f.name) for f in
bucket.list_blobs(prefix='data/TFR/train')]
validation_records = [os.path.join('gs://{BUCKET_NAME}/', f.name) for f in
bucket.list_blobs(prefix='data/TFR/validation')]
ds_train = tf.data.TFRecordDataset(train_records, num_parallel_reads=4).map(decode)
ds_val = tf.data.TFRecordDataset(validation_records,num_parallel_reads=4).map(decode)
# potential additional steps for performance:
# https://www.tensorflow.org/guide/performance/datasets)
# Train the model
model.fit(ds_train,
validation_data=ds_val,
...,
verbose=2)
Check out this blog post for an actual implementation of a similar (more complex) kubeflow pipeline

How to pass a bigger .csv files to amazon sagemaker for predictions using batch transform jobs

I created a custom model and deployed it on sagemaker. I am invoking the endpoint using batch transform jobs. It works if the input file is small, i.e, number of rows in the csv file is less. If I upload a file with around 200000 rows, I am getting this error in the cloudwatch logs.
2018-11-21 09:11:52.666476: W external/org_tensorflow/tensorflow/core/framework/allocator.cc:113]
Allocation of 2878368000 exceeds 10% of system memory.
2018-11-21 09:11:53.166493: W external/org_tensorflow/tensorflow/core/framework/allocator.cc:113]
Allocation of 2878368000 exceeds 10% of system memory.
[2018-11-21 09:12:02,544] ERROR in serving: <_Rendezvous of RPC that
terminated with:
#011status = StatusCode.DEADLINE_EXCEEDED
#011details = "Deadline Exceeded"
#011debug_error_string = "
{
"created": "#1542791522.543282048",
"description": "Error received from peer",
"file": "src/core/lib/surface/call.cc",
"file_line": 1017,
"grpc_message": "Deadline Exceeded",
"grpc_status": 4
}
"
Any ideas what might be going wrong. This is the transform function which I am using to create the transform job.
transformer =sagemaker.transformer.Transformer(
base_transform_job_name='Batch-Transform',
model_name='sagemaker-tensorflow-2018-11-21-07-58-15-887',
instance_count=1,
instance_type='ml.m4.xlarge',
output_path='s3://2-n2m-sagemaker-json-output/out_files/'
)
input_location = 's3://1-n2m-n2g-csv-input/smal_sagemaker_sample.csv'
transformer.transform(input_location, content_type='text/csv', split_type='Line')
The .csv file contains 2 columns for first and last name of customer, which I am then preprocessing it in the sagemaker itself using input_fn().

The error looks to be coming from a GRPC client closing the connection before the server is able to respond. (There looks to be an existing feature request for the sagemaker tensorflow container on https://github.com/aws/sagemaker-tensorflow-container/issues/46 to make this timeout configurable)
You could try out a few things with the sagemaker Transformer to limit the size of each individual request so that it fits within the timeout:
Set a max_payload to a smaller value, say 2-3 MB (the default is 6 MB)
If your instance metrics indicate it has compute / memory resources to spare, try max_concurrent_transforms > 1 to make use of multiple workers
Split up your csv file into multiple input files. With a bigger dataset, you could also increase the instance count to fan out processing

A change was made and merged in to allow users to configure the timeout through an environment variable, SAGEMAKER_TFS_GRPC_REQUEST_TIMEOUT.
https://github.com/aws/sagemaker-tensorflow-container/pull/135
https://github.com/aws/sagemaker-tensorflow-container/blob/master/src/tf_container/proxy_client.py#L30

Where are models saved by default?

I've submitted a training job to the cloud using the RESTful API and see in the console logs that it completed successfully. In order to deploy the model and use it for predictions I have saved the final model using tf.train.Saver().save() (according to the how-to guide).
When running locally, I can find the graph files (export-* and export-*.meta) in the working directory. When running on the cloud however, I don't know where they end up. The API doesn't seem to have a parameter for specifying this, it's not in the bucket with the trainer app, and I can't find any temporary buckets on the cloud storage created by the job.

When you set up your Cloud ML environment you set up a bucket for this purpose. Have you looked in there?
https://cloud.google.com/ml/docs/how-tos/getting-set-up
Edit (for future record): As Robert mentioned in comments, you'll want to pass the output location to the job as an argument. Couple of things to be mindful of:
Use a unique output location per job, so one job doesn't clobber over the outputs of another.
The recommendation is to specify the parent output path, and use it to contain the exported model in a subpath called 'model', as well as organizing other outputs like checkpoints and summaries within that path. That makes it easier to manage all the outputs.
While not required, I'll also suggest staging the training code in a packages subpath of the output, which helps correlate the source with the outputs it produces.
Finally(!), also keep in mind when you use hyperparameter tuning, you'll need to append the trial id to the output path for outputs produced by individual runs.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js