Specify checkpoint path in custom docker image in SageMaker - amazon-web-services

I am training a model on SageMaker using a custom docker image.
I need to specify the local path (the one in the container) used to store checkpoints, so that SageMaker can copy its output to S3.
According to the documentation here https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html , I can do that when I initialize the Estimator:
# The local path where the model will save its checkpoints in the training container
checkpoint_local_path="/opt/ml/checkpoints"
estimator = Estimator(
...
image_uri="<ecr_path>/<algorithm-name>:<tag>" # Specify to use built-in algorithms
output_path=bucket,
base_job_name=base_job_name,
# Parameters required to enable checkpointing
checkpoint_s3_uri=checkpoint_s3_bucket,
checkpoint_local_path=checkpoint_local_path
)
I'd like better to specify the checkpoint_local_path within the docker build. Is there a way to do that when building the image? Maybe using an environment variable? This would be also more consistent to what AWS recommend: *We recommend specifying the local paths as '/opt/ml/checkpoints' to be consistent with the default SageMaker checkpoint settings. *

unlike you don't like the /opt/ml/checkpoints name, you don't need to specify anything in your docker, apart from writing in /opt/ml/checkpoints (and reading from it if you're doing transfer learning or want to pickup from previously saved checkpoints)
Anything you write to /opt/ml/checkpoints in your container will be saved in S3 at the location you specify in checkpoint_s3_uri='s3://...'

Related

After training in AI Platform, where can I find model.bst or other model file?

I trained a XGBoost model using AI Platform as here.
Now I have the choice in the Console to download the model, as follows (but not Deploy it, since "Only models trained with built-in algorithms can be deployed from this page"). So, I click to download.
However, in the bucket the only file I see is a tar, as follows.
That tar (directory tree follows) holds only some training code, and not a model.bst, model.pkl, or model.joblib, or other such model file.
Where do I find model.bst or the like, which I can deploy?
EDIT:
Following the answer, below, we see that the "Download model" button is misleading as it sends us to the job directory, not the output directory (which is set arbitrarily in the codel the model is at census_data_20210527_215945/model.bst )
bucket = storage.Client().bucket(BUCKET_ID)
blob = bucket.blob('{}/{}'.format(
datetime.datetime.now().strftime('census_%Y%m%d_%H%M%S'),
model))
blob.upload_from_filename(model)
Only in-build algorithms automatically store the model in Google Cloud storage.
In your case, you have a custom training application.
You have to take care of saving the model on your own.
Referring to your example this is implemented as listed here.
The model is uploaded to Google Cloud Storage using the cloud storage client.

Modifying image in Active Storage cloud

I'm using Rails 5.2 and GCS as cloud service.
I'd like to give an opportunity to users to crop and rotate user's image.
User has many Images, Image has one :image_file attached
In development I use such method:
class Image
...
def rotate(degree)
image = MiniMagick::Image.new(ActiveStorage::Blob.service.send(:path_for, self.image_file.key))
image.rotate "#{degree}"
image.write(ActiveStorage::Blob.service.send(:path_for, self.image_file.key))
self.image_file.blob.analyze
end
...
end
But I can't figure out how to get to image files in cloud.
I've made it to download the file to local storage and make all the operations needed.
Now it takes only to replace (delete current and create a new one with the same name) the file in the cloud (without changing anything in the database records if possible), but I can't figure out how to do this with active storage.
At least I need to get the file name in the cloud to use just bare google-cloud-ruby
To list files stored in Cloud Storage bucket using Ruby on Rails see the code example defined here. You can also upload files to cloud storage bucket and delete files from them using Ruby on Rails.
Also since you are allowing your customers to modify their files in Cloud Storage buckets, you may consider using versioning. This will incur you additional cost but will provide reliability for your customers.
Here is the link to Ruby on Google Cloud Platform documentation which might be helpful to you.

How to make parameters available to SageMaker Tensorflow Endpoint

I'm looking to make some hyper parameters available to the serving endpoint in SageMaker. The training instances is given access to input parameters using hyperparameters in:
estimator = TensorFlow(entry_point='autocat.py',
role=role,
output_path=params['output_path'],
code_location=params['code_location'],
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
training_steps=10000,
evaluation_steps=None,
hyperparameters=params)
However, when the endpoint is deployed, there is no way to pass in parameters that are used to control the data processing in the input_fn(serialized_input, content_type) function.
What would be the best way to pass parameters to the serving instance?? Is the source_dir parameter defined in the sagemaker.tensorflow.TensorFlow class copied to the serving instance? If so, I could use a config.yml or similar.
Ah i have had a similar problem to you where I needed to download something off S3 to use in the input_fn for inference. In my case it was a dictionary.
Three options:
use your config.yml approach, and download and import the s3 file from within your entrypoint file before any function declarations. This would make it available to the input_fn
Keep using the hyperparameter approach, download and import the vectorizer in serving_input_fn and make it available via a global variable so that input_fn has access to it.
Download the file from s3 before training and include it in the source_dir directly.
Option 3 would only work if you didnt need to make changes to the vectorizer seperately after initial training.
Whatever you do, don't download the file directly in input_fn. I made that mistake and the performance is terrible as each invoking of the endpoint would result in the s3 file being downloaded.
The Hyper-parameters are used in the training phase to allow you to tune (Hyper-Parameters Optimization - HPO) your model. Once you have a trained model, these hyper-parameters are not needed for inference.
When you want to pass features to the serving instances you usually do that in the BODY of each request to the invoke-endpoint API call (for example see here: https://docs.aws.amazon.com/sagemaker/latest/dg/tf-example1-invoke.html) or the call to the predict wrapper in the SageMaker python SDK (https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow). You can see such examples in the sample notebooks (https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/tensorflow_iris_byom/tensorflow_BYOM_iris.ipynb)
Yes, one option is to add your configuration file to source_dir and load the file in the input_fn.
Another option is to use serving_input_fn(hyperparameters). That function transforms the TensorFlow model in a TensorFlow serving model. For example:
def serving_input_fn(hyperparameters):
# gets the input shape from the hyperparameters
shape = hyperparameters.get('input_shape', [1, 7])
tensor = tf.placeholder(tf.float32, shape=shape)
# returns the ServingInputReceiver object.
return build_raw_serving_input_receiver_fn({INPUT_TENSOR_NAME: tensor})()
tensorflow amazon-sagemaker hyperparameters tensorflow-serving

Where are models saved by default?

I've submitted a training job to the cloud using the RESTful API and see in the console logs that it completed successfully. In order to deploy the model and use it for predictions I have saved the final model using tf.train.Saver().save() (according to the how-to guide).
When running locally, I can find the graph files (export-* and export-*.meta) in the working directory. When running on the cloud however, I don't know where they end up. The API doesn't seem to have a parameter for specifying this, it's not in the bucket with the trainer app, and I can't find any temporary buckets on the cloud storage created by the job.
When you set up your Cloud ML environment you set up a bucket for this purpose. Have you looked in there?
https://cloud.google.com/ml/docs/how-tos/getting-set-up
Edit (for future record): As Robert mentioned in comments, you'll want to pass the output location to the job as an argument. Couple of things to be mindful of:
Use a unique output location per job, so one job doesn't clobber over the outputs of another.
The recommendation is to specify the parent output path, and use it to contain the exported model in a subpath called 'model', as well as organizing other outputs like checkpoints and summaries within that path. That makes it easier to manage all the outputs.
While not required, I'll also suggest staging the training code in a packages subpath of the output, which helps correlate the source with the outputs it produces.
Finally(!), also keep in mind when you use hyperparameter tuning, you'll need to append the trial id to the output path for outputs produced by individual runs.

Saving a file in AWS filesystem

Hi I am trying out opencv in AWS lambda. I want to save a SVM model in txt file so that I can load it again. Is it possible to save it in tmp directory and load it from there whenever I need it or will I have to use s3?
I am using python and trying to do something like this:
# saving the model
svm.save("/tmp/svm.dat")
# Loading the model
svm = cv2.ml.SVM_load("/tmp/svm.dat")
Its not possible as Lambda execution environment is distributed and therefore the same function might run on several different instances.
The alternative is to save your svm.dat to S3 and then download it every time you start your lambda function.