AWS Sagemaker processing Job automatically created? - amazon-web-services

I haven't used sagemaker for a while and today I started a training job (with the same old settings I always used before), but this time I noticed that a processing job has been automatically created and it's running while my training job run (I presume for debugging purpose).
I'm sure that this is the first time that it happens.. Is that a new feature introduced by sagemaker? I didn't find any related in documentation, but it's important to know because I don't want extra costs..
This is the image used by the processing job, with a instance type of ml.m5.2xlarge which I didn't set anywhere..
929884845733.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest

I can answer my question.. it seems to be a new feature as highlighted here. You can turn it off as suggested in the doc:
To disable both monitoring and profiling, include the disable_profiler parameter to your estimator and set it to True.

Related

How to use SageMaker Experiments trackers and trial components?

I'm completely confused with how SageMaker Experiments works. I used the SDK to create an Experiment and a Trial. Now I want to track job parameters, metadata and metrics.
Shall I create Trial components manually with the SDK or let SM Estimator fit create them for me??
after creating my experiment and trial, I use the below code
job.fit(inputs,
experiment_config={
"ExperimentName": reg_experiment.experiment_name,
"TrialName": trial1.trial_name,
"TrialComponentDisplayName": "training-with-RF1"},
wait=False)
When I look in Studio, I see an automatically created Trial component named "training-with-RF1".
I see here and here that we can (can = must? should? could?...) also create Trials manually, for example with
my_trial = trial.Trial.create('AutoML')
my_tracker = tracker.Tracker.create()
my_tracker.log_parameter('learning_rate', 0.01)
my_trial.add_trial_component(my_tracker)
Or here with
Trial.create(
trial_name=trial_name,
experiment_name=mnist_experiment.experiment_name,
sagemaker_boto_client=sm)
When I create trials like that manually, they appear as separate empty trials than the trials created by SageMaker jobs, see below.
I'm confused because the AWS blog post says we have to create Trials manually, however SageMaker Training jobs seem to be creating those trials on our behalf...
I'm completely confused by this service, can someone please help?
The best way to do this is to create an Experiment, a Trial and then pass the experiment config to the Training Job. The training job will automatically create a Trial Component and add it to the Trial.
Depending on the type of training job you are using, some metrics will automatically be tracked in the Trial Component. You can set this up through metric_definitions regex in the Estimator.
If you are running the training job in script mode, you can install sagemaker-experiments in the container running the job (or in the python script using subprocess.call) and import the Tracker object. You can use the Tracker to log metrics from the training script to the Trial Component.
There are some examples here - https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-experiments
This is the documentation for sagemaker-experiments sdk - https://sagemaker-experiments.readthedocs.io/en/latest/tracker.html

Setting Instance Type in Sagemaker Experiements

Im pretty new to sagemaker had a quick question
Using experiments in Sagemaker Studio it spins up ml.m5.4xlarge instance automatically
Is there anyway of setting a smaller instance size.... maybe to one that fits under the free tier....
Just playing around with experiments and will require a few runs so want to keep the cost down to start with?
TIA
When you are using a SageMaker Experiments, there is an associated TrainingJob or Processing Job with that Experiment. It is the Training/Processing Job that spins up the instanc. That is where you can edit the instance type. Experiments itself is available to no extra cost when using SageMaker Studio.
I work for AWS & my opinions are my own.

Reloading from checkpoing during AWS Sagemaker Training

Sagemaker is a great tool to train your models, and we save some money by using AWS spot instances. However, training jobs sometimes get stopped in the middle. We are using some mechanisms to continue from the latest checkpoint after a restart. See also the docs.
Still, how do you efficiently test such a mechanism? Can you trigger it yourself? Otherwise you have to wait until the spot instance actually ís restarted.
Also, are you expected to use the linked checkpoint_s3_uri argument or the model_dir for this? E.g. the TensorFlow estimator docs seem to suggest something model_dirfor checkpoints.
Since you can't manually terminate a sagemaker instance, run an Amazon SageMaker Managed Spot training for a small number of epochs, Amazon SageMaker would have backed up your checkpoint files to S3. Check that checkpoints are there. Now run a second training run, but this time provide the first jobs’ checkpoint location to checkpoint_s3_uri. Reference is here, this also answer your second question.

How to automate the Updating/Editing of Amazon Data Pipeline

I want to use AWS Data Pipeline service and have created some using the manual JSON based mechanism which uses the AWS CLI to create, put and activate the pipeline.
My question is that how can I automate the editing or updating of the pipeline if something changes in the pipeline definition? Things that I can imagine changing could be schedule time, addition or removal of Activities or Preconditions, references to DataNodes, resources definition etc.
Once the pipeline is created, we cannot edit quite a few things as mentioned here in the official doc: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-manage-pipeline-modify-console.html#dp-edit-pipeline-limits
This makes me believe that if I want to automate the updating of pipeline then I would have to delete and re-create/activate a new pipeline? If yes, then the next question is that how can I create a automated process which identifies the previous version's ID, deletes it and creates a new one? Essentially trying to build a release management flow for this where the configuration JSON file is released and deployed automatically.
Most commands like activate, delete, list-runs, put-pipeline-definition etc. take the pipeline-id which is not known until a new pipeline created. I am unable to find anything which remains constant across updates or recreation (the unique-id and name parameters of the createpipeline command are consistent but then I can't use them for the above mentioned tasks (I need pipeline-id for that.
Of course I can try writing shell scripts which grep and search the output and try to create a script but is there any other better way? Some other info that I am missing?
Thanks a lot.
You cannot edit schedules completely or change references so creating/deleting pipelines seems to be the best way for your scenario.
You'll need the pipeline-id to delete a pipeline. Is it not possible to keep a record of that somewhere? You can have a file with the last used id stored locally or in S3 for instance.
Some other ways I can think of are:
If you have only 1 pipeline in the account you can list-pipelines and
use the only result
If you have the pipeline name you can list-pipelines and find the id

What is the difference between deploy and redeploy in SAS DI Studio?

Am guessing that it might just retain the metadata ID (redeploy) as opposed to generating a new one (deploy), is that the only difference though?
It is only difference but it is very important. You should always redeploy jobs that any flow is dependent on. If you deploy job that was already added to a flow, the flow will be damaged.
I disagree with the previous comment by #barjey that the flow is damaged. It is not damaged , its just that a New metadata ID is created in the form of a new .sas file for the same job, like JOB_NAME_00000.sas and deployed
This adds to lot of confusion and a lot of versions of the same jobs float around, which is incorrect. That's the reason a job is always re-deployed so that the previous version of the code is over-written and new changes reflected in the flow.
Yo redeploy a job to incorporate the environmental changes by automatically identifying the environment to which you have deployed the job .These changes are then reflected at back end where the job is actually saved (job which is scheduled in a flow).