How to automatically stop Sagemaker notebook instances if it is idle? - amazon-web-services

I have been looking for a script to automatically close Sagemaker Notebook Instances that have been forgotten to be closed or that are idle. A few scripts I found don't work very well (eg: link , it is only checking if ipynb file is live, Im not using .ipynb, or taking the last updated info which never changes until you shut down or open the instance)
Is there a resource or script you can recommend?

You can use the following script to find idle instances. You can modify the script to stop the instance if idle for more than 5 minutes or have a cron job to stop the instance.
import boto3
last_modified_threshold = 5 * 60
sm_client = boto3.client('sagemaker')
response = sm_client.list_notebook_instances()
for item in response['NotebookInstances']:
last_modified_seconds = item['LastModifiedTime'].timestamp()
last_modified_minutes = last_modified_seconds/60
print(last_modified_minutes)
if last_modified_minutes > last_modified_threshold:
print('Notebook {0} has been idle for more than {1} minutes'.format(item['NotebookInstanceName'], last_modified_threshold/60))

Related

GCP - Initate a shutdown to an instance after certein time when it started (for example 3 hours after started)

I have instances in GCP.
I can schedule a time to start and stop using the scheduler.
But, I don't want a specific time of the day, I want a specific time after instance was started.
For example - Stop the instance after 8 hours the instance is up and running.
You can add the contents of a startup script directly to a VM when you create the VM.
You can also pass a Linux startup script directly to an existing VM:
In your Cloud Console go to VM Instance page and click on the instance you want to pass the start up script
Click Edit.
Under Automation, specify the following:
#! /bin/bash
shutdown -P +60
-P Instructs the system to shut down and then power down.
The time argument specifies when to perform the shutdown operation.
The time can be formatted in different ways:
First, it can be an absolute time in the format hh:mm, where hh is the hour (1 or 2 digits, from 0 to 23) and mm is the minute of the hour (in two digits).
Second, it can be in the format +m, where m is the number of minutes to wait.
Also, the word now is the same as specifying +0; it shuts the system down immediately.

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Running Batch python processes on Google Cloud

I have couple of Python scripts which I would like to schedule to run once a month on Google cloud. The scripts basically trigger DLP jobs, extract data catalog information to a file in GCS. These batch workloads would hardly run for 30 mins. And so I don't need to use services like GKE, composer etc which are very resource intensive.
For these batch workloads I would like to know the best options available in GCP. Looking at some of the blog posts I found below article to use Cloud Scheduler-> Pub/Sub-> Cloud Functions -> Create VM (using a startup script).
https://medium.com/google-cloud/running-a-serverless-batch-workload-on-gcp-with-cloud-scheduler-cloud-functions-and-compute-86c2bd573f25
I have below questions with above design..
1) How long does the Cloud Function run as it starts the VM? I know cloud function has a timeout of 9mins ..what happens if the VM takes longer than 9mins to process the startup script?
Any other design ideas are much appreciated.
Thanks
I'm the author of that medium post.
1) How long does the Cloud Function run as it starts the VM?
You can change the Cloud Function code to not wait for the response, It's using NodeJS so you just don't have to wait for the Promise.
Also in that solution the Cloud Function job is only to trigger the VM creation.
.createVM(vmName, vmConfig)
.then(data => {
// Operation pending.
const vm = data[0];
const operation = data[1];
console.log(`VM being created: ${vm.id}`);
console.log(`Operation info: ${operation.id}`);
return operation.promise();
// This will return right away with the VM pending state, you can finish
// your logic here, and not wait for VM creation to finish.
// You can even ignore this step if you don't need the VM ID logged for
// debugging purposes
})
.then(() => {
const message = 'VM created with success, Cloud Function finished execution.';
console.log(message);
}
Using that same code, in the worst case (if it takes more than 9 minutes), the Cloud Function will timeout but the VM creation will continue.
The desing that I suggest is using: Cloud Scheduler + Pub/Sub + Compute Engine
This design in few words:
- you compute engine will have a utility that listens to a Cloud Pub/Sub topic
- this utility will execute upon receiving a new event from the Topic and run a cron job on the instance
- Cloud scheduler is used here to push messages to the Pub/Sub Topic in a time that you can specify in your job.
By using Pub/Sub to decouple the task-scheduling logic from the logic
running the commands on Compute Engine, you can update your cron
scripts as needed, without updating the Cloud Scheduler configuration.
You can also change your task schedule without updating the utility
service on your Compute Engine instances
you can find full explanation of this design and a sample code by following this and this.
let me know if there is anything not obvious.

Airflow : need advices when running a lot of instances per task

This is my 1st post on Stack and it is about Airflow. I need to implement a DAG which will :
1/ Download files from an API
2/ Upload them into Google Cloud Storage
3/ Insert them into BigQuery
The thing is that the step 1 involves about 170 accounts to be call. If any error is raised during the download, I want my DAG to automatically retry it from the abended step. Therefore I implemented a loop above my tasks such as :
dag = DAG('my_dag', default_args=DEFAULT_ARGS)
for account in accounts:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
dag=my_dag)
t2 = FileToGoogleCloudStorageOperator(task_id='upload_file_' + account['id'],
google_cloud_storage_conn_id = 'gcs_my_conn',
src = 'file_' + account['id'] + '.json',
bucket = 'my_bucket',
dag=my_dag)
t3 = GoogleCloudStorageToBigQueryOperator(task_id='insert_bq',
bucket = 'my_bucket',
google_cloud_storage_conn_id = 'gcs_my_conn',
bigquery_conn_id = 'bq_my_conn',
src = 'file_' + account['id'],
destination_project_dataset_table = 'my-project:my-dataset.my-table',
source_format = 'NEWLINE_DELIMITED_JSON',
dag=my_dag)
t2.set_upstream(t1)
t3.set_upstream(t2)
So at UI level, I have about 170 instances of each task display. When I run the DAG manually, Airflow is just doing nothing as far as I can see. The DAG is don't init or queued any task instance. I guess this is due to the number of instances involve but I don't know how can I workaround this.
How should I manage so many task instances ?
Thanks,
Alex
How are you running airflow currently? Are you sure the airflow scheduler is running?
You can also run airflow list_dags to ensure the dag can be compiled. If you are running airflow using Celery you should take care that your dag shows up using list_dags on all nodes running airflow.
Alex, it would be easier to post here, I saw you have DEFAULT_ARGS with retries which is at DAG level, you can also set up retries at task level as well. It is in BaseOperator, since all Operator will inherit the BaseOperator then you can use it, you can find more detail here: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/python_operator.py and https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L1864, if you check BaseOperator in model, it has retries and retry_delay, you can do something like this:
t1 = PythonOperator(task_id='download_file_' + account['id'],
python_callable=download_files(account),
retries=3,
retry_delay=timedelta(seconds=300),
dag=my_dag)

Amazon Elastic Map Reduce - Keep Server alive?

I am testing jobs in EMR and each and every test takes a lot of time to start up. Is there a way to keep the server/master node alive in Amazon EMR? I know this can be done with the API. But, I wanted to know if this can be done in the aws console?
You cannot do this from the AWS console. To quote the developer guide
The Amazon Elastic MapReduce tab in the AWS Management Console does not support adding steps to a job flow.
You can only do this via the CLI and API, by creating a job flow, then adding steps to it.
$ ./elastic-mapreduce --create --active --stream
You can't do this with the web console - but through the API and programming tools, you will be able to add multiple steps to a long-running job, which is what I do. That way you can fire off jobs one after the other on the same long-running cluster, without having to re-create a new one each time.
If you are familiar with Python, I highly recommend the Boto library. The other AWS API tools let you do this as well.
If you follow the Boto EMR tutorial, you'll find some examples:
Just to give you an idea, this is what I do (with streaming jobs):
# Connect to EMR
conn = boto.connect_emr()
# Start long-running job, don't forget keep_alive setting
jobid = conn.run_jobflow(name='My jobflow',
log_uri='s3://<my log uri>/jobflow_logs',
keep_alive=True)
# Create your streaming job
step = StreamingStep(...)
# Add the step to the job
conn.add_jobflow_steps(jobid, [step])
# Wait till its complete
while True:
state = conn.describe_jobflow(jobid).steps[-1].state
if (state == "COMPLETED"):
break
if (state == "FAILED") or (state == "TERMINATED") or (state == "CANCELLED"):
print >> sys.stderr, ("EMR job failed! Message = %s!") % (state)
sys.exit(1)
time.sleep (60)
# Create your next job here and add it to the EMR cluster
step = StreamingStep(...)
conn.add_jobflow_steps(jobid, [step])
# Repeat :)
to keep the machine alive start an interactive pig session. Then the machine won't shut down. You can then execute your map/reduce logic from the command line using:
cat infile.txt | yourMapper | sort | yourReducer > outfile.txt