Automatically "stop" Sagemaker notebook instance after inactivity? - amazon-web-services

I have a Sagemaker Jupyter notebook instance that I keep leaving online overnight by mistake, unnecessarily costing money...
Is there any way to automatically stop the Sagemaker notebook instance when there is no activity for say, 1 hour? Or would I have to make a custom script?

You can use Lifecycle configurations to set up an automatic job that will stop your instance after inactivity.
There's a GitHub repository which has samples that you can use. In the repository, there's a auto-stop-idle script which will shutdown your instance once it's idle for more than 1 hour.
What you need to do is
to create a Lifecycle configuration using the script and
associate the configuration with the instance. You can do this when you edit or create a Notebook instance.
If you think 1 hour is too long you can tweak the script. This line has the value.

You could also use CloudWatch + Lambda to monitor Sagemaker and stop when your utilization hits a minimum. Here is a list of what's available in CW for SM: https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html.
For example, you could set a CW alarm to trigger when CPU utilization falls below ~5% for 30 minutes and have that trigger a Lambda which would shut down the notebook.

After we've burned quite a lot of money by forgetting to turn off these machines, I've decided to create a script. It's based on AWS' script, but provides an explanation why the machine was or was not killed. It's pretty lightweight because it does not use any additional infrastructure like Lambda.
Here is the script and the guide on installing it! It's just a simple lifecycle configuration!

Unfortunately, automatically stopping the Notebook Instance when there is no activity is not possible in SageMaker today. To avoid leaving them overnight, you can write a cron job to check if there's any running Notebook Instance at night and stop them if needed.

SageMaker Studio Notebook Kernels can be terminated by attaching the following lifecycle configuration script to the domain.
#!/bin/bash
# This script installs the idle notebook auto-checker server extension to SageMaker Studio
# The original extension has a lab extension part where users can set the idle timeout via a Jupyter Lab widget.
# In this version the script installs the server side of the extension only. The idle timeout
# can be set via a command-line script which will be also created by this create and places into the
# user's home folder
#
# Installing the server side extension does not require Internet connection (as all the dependencies are stored in the
# install tarball) and can be done via VPCOnly mode.
set -eux
# timeout in minutes
export TIMEOUT_IN_MINS=120
# Should already be running in user home directory, but just to check:
cd /home/sagemaker-user
# By working in a directory starting with ".", we won't clutter up users' Jupyter file tree views
mkdir -p .auto-shutdown
# Create the command-line script for setting the idle timeout
cat > .auto-shutdown/set-time-interval.sh << EOF
#!/opt/conda/bin/python
import json
import requests
TIMEOUT=${TIMEOUT_IN_MINS}
session = requests.Session()
# Getting the xsrf token first from Jupyter Server
response = session.get("http://localhost:8888/jupyter/default/tree")
# calls the idle_checker extension's interface to set the timeout value
response = session.post("http://localhost:8888/jupyter/default/sagemaker-studio-autoshutdown/idle_checker",
json={"idle_time": TIMEOUT, "keep_terminals": False},
params={"_xsrf": response.headers['Set-Cookie'].split(";")[0].split("=")[1]})
if response.status_code == 200:
print("Succeeded, idle timeout set to {} minutes".format(TIMEOUT))
else:
print("Error!")
print(response.status_code)
EOF
chmod +x .auto-shutdown/set-time-interval.sh
# "wget" is not part of the base Jupyter Server image, you need to install it first if needed to download the tarball
sudo yum install -y wget
# You can download the tarball from GitHub or alternatively, if you're using VPCOnly mode, you can host on S3
wget -O .auto-shutdown/extension.tar.gz https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension/raw/main/sagemaker_studio_autoshutdown-0.1.5.tar.gz
# Or instead, could serve the tarball from an S3 bucket in which case "wget" would not be needed:
# aws s3 --endpoint-url [S3 Interface Endpoint] cp s3://[tarball location] .auto-shutdown/extension.tar.gz
# Installs the extension
cd .auto-shutdown
tar xzf extension.tar.gz
cd sagemaker_studio_autoshutdown-0.1.5
# Activate studio environment just for installing extension
export AWS_SAGEMAKER_JUPYTERSERVER_IMAGE="${AWS_SAGEMAKER_JUPYTERSERVER_IMAGE:-'jupyter-server'}"
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
eval "$(conda shell.bash hook)"
conda activate studio
fi;
pip install --no-dependencies --no-build-isolation -e .
jupyter serverextension enable --py sagemaker_studio_autoshutdown
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
conda deactivate
fi;
# Restarts the jupyter server
nohup supervisorctl -c /etc/supervisor/conf.d/supervisord.conf restart jupyterlabserver
# Waiting for 30 seconds to make sure the Jupyter Server is up and running
sleep 30
# Calling the script to set the idle-timeout and active the extension
/home/sagemaker-user/.auto-shutdown/set-time-interval.sh
Resource
https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html
https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples/blob/main/scripts/install-autoshutdown-server-extension/on-jupyter-server-start.sh

Related

AWS EC2 User Data not working (Tried Installing and starting httpd via User Data)

The Following is my EC2 User Data:
#!/bin/bash
sudo yum update -y
sudo yum install -y httpd
sudo systemctl start httpd
sudo systemctl enable httpd
In Security Group SSH 22 Port and HTTP 80 Port is Open.
Yet when I try accessing http://public_ip_of_instance the HTTP Apache page doesn't load.
Also, on the Instance Apache is not installed when I checked sudo systemctl status httpd.
I then manually tried it on the EC2 Server and it worked. Then I removed it through yum remove as I wanted to see whether User Data works.
I stopped the Instance and started again but I observed that the User Data Script doesn't work as I am unable to access http page through browser and also on Instance http is not installed.
Where is the actual issue? Some months back this same thing worked on another instance I remember.
Your user data is correct. Whatever is happening with your website is not due to the user data code that you provided.
There could be many reasons it does not work. Public IP of the instance has changed, as always happens when you stop/start the instance. Instance may have per-existing software that clashes with httpd.
Here's some general advice on running UserData once or each startup.
Short answer as John mentioned in the comments EC2's only run the UserData (aka Bootstrap) script once on initalization.
The user data Bash/Powershell is Infrastructure-As-Code. You deploy the script and it installs and configures the machine.
This causes confusion with everyone starting AWS. When you think about it though it doesn't make sense to run the UserData script each time when the PCs already been configured.
What people do often instead is make "Golden Images" (aka Amazon Machine Images - AMI's) of pre-setup EC2s, typically for PCs that take long time to install/configure. The beauty of this is you can setup AutoScaleGroups to use the images which saves any long installation during a scale up event.
Pro Tip: When developing an UserData script run through and test it manually on the EC2. Trust me its far quicker than troubleshooting unattended EC2 UserData errors.
Long answer: you can run the UserData on each boot of the machine using Mime multi-part file. A mime multi-part file allows your script to override how frequently user data is run in the cloud-init package.
https://aws.amazon.com/premiumsupport/knowledge-center/execute-user-data-ec2/
For all those who will run into this problem, first of all check the log with the command:
sudo cat /var/log/cloud-init-output.log
then if you notice connection errors to the various repositories, the reason is because you don't have an internet connection. However, if once inside your EC2 you manage to launch the update and install commands, then the reason why they fail in the UserData is because your EC2 takes a few seconds to get the Internet connection and executes the commands before having it. So to solve this problem, just add this command after #!/bin/bash
#!/bin/bash
until ping -c1 8.8.8.8 &>/dev/null; do :; done
sudo yum update -y
...
This will prevent your EC2 from executing commands before an internet connection is established

Google Cloud Platform: cloudshell - is there any way to "keep" gcloud init configs?

Does anyone know of a way to persist configurations done using "gcloud init" commands inside cloudshell, so they don't vanish each time you disconnect?
I figured out how to persist python pip installs using the --user
example: pip install --user pandas
But, when I create a new configuration using gcloud init, use it for a bit, close cloudshell (or cloudshell times out on me), then reconnect later, the configurations are gone.
Not a big deal, I bounce between projects/etc so it's nice to have the configs saved so I can simply run
gcloud config configurations activate config-name
Thanks...Rich Murnane
Google Cloud Shell only persists data in your $HOME directory. Commands like gcloud init modify the environment variables and store configuration files in /tmp which is deleted when the VM is restarted. The VM is terminated after being idle for 20 minutes or 60 minutes depending on which document you read.
Google Cloud Shell is a Docker container. You can modify the docker image to customize to fit your needs. This method will allow you to install packages, tools, etc that are not located in your $HOME directory.
You can also store your files and configuration scripts on Google Cloud Storage. Modify .bashrc to download your cloud files and run your configuration script.
Either method will allow you to create a persistent environment.
This StackOverflow answer covers in detail what gcloud init does and how to basically emulate the same thing via script or command line.
gcloud init details
this isn't exactly what I wanted, but since my
account (userid) isn't changing, I'm simply going to
do the command
gcloud config set project second-project-name
good enough, thanks...Rich

Running updates on EC2s that roll back on failure of status check

I’m setting up a patch process for EC2 servers running a web application.
I need to build an automated process that installs system updates but, reverts back to the last working ec2 instance if the web application fails a status check.
I’ve been trying to do this using an Automation Document in EC2 Systems Manager that performs the following steps:
Stop EC2 instance
Create AMI from instance
Launch new instance from newly created AMI
Run updates
Run status check on web application
If check fails, stop new instance and restart original instance
The Automation Document runs the first 5 steps successfully, but I can't identify how to trigger step 6? Can I do this within the Automation Document? What output would I be able to call from step 5? If it uses aws:runCommand, should the runCommand trigger a new automation document or another AWS tool?
I tried the following to solve this, which more or less worked:
Included an aws:runCommand action in the automation document
This ran the DocumentName "AWS-RunShellScript" with the following parameters:
Downloaded the script from s3:
sudo aws s3 cp s3://path/to/s3/script.sh /tmp/script.sh
Set the file to executable:
chmod +x /tmp/script.sh
Executed the script using variables set in, or generated by the automation document
bash /tmp/script.sh -o {{VAR1}} -n {{VAR2}} -i {{VAR3}} -l {{VAR4}} -w {{VAR5}}
The script included the following getopts command to set the inputted variables:
while getopts o:n:i:l:w: option
do
case "${option}"
in
n) VAR1=${OPTARG};;
o) VAR2=${OPTARG};;
i) VAR3=${OPTARG};;
l) VAR4=${OPTARG};;
w) VAR5=${OPTARG};;
esac
done
The bash script used the variables to run the status check, and roll back to last working instance if it failed.

Google cloud compute startup script ignored with no logging

I have a standard Debian 8.9 instance on google cloud compute (GCE) where my startup script is ignored.
In the custom metadata field, for startup-script, I am trying to run an Rscript (which is used for batch execution of R files), followed by a system shutdown, with the following:
#! /bin/bash
sudo /usr/bin/Rscript /home/myuser/launch_script.R
sudo shutdown -h now
Starting the instance is immediately followed by a shutdown and the Rscript is ignored. Removing the last line to shutdown causes the GCE instance to start, but the Rscript to be ignored. Running just "sudo /usr/bin/Rscript /home/myuser/launch_script.R" from the terminal results in the script being run. It has a chmod of 755, so I don't think this is a permissions issue.
In addition to this problem, I have read elsewhere that logging should happen in /var/log/, but there is nothing there. Instead, I have a bunch of log files (that only contain the start-up script and nothing else) in the root of my instance:
I got in touch with Google cloud support, who gave the following response:
script definition is kept under /var/run/google.startup.script
If the script does not run initially, you can force it manually with : $ sudo google_metadata_script_runner --script-type startup # for Debian, or # sudo /usr/share/google/run-startup-scripts # on Ubuntu and older images
I'm posting this information here, because it is not in their documentation (as of August 2017). I'm not sure how helpful it is, since the google.startup.script didn't exist in my case (using the latest Debian image on GCE), but I did run the other commands.
However, I think my main issues were:
I was using autossh to connect to a remote database. The startup-script was running before autossh. Building a 40 second delay into the script and running the script as a user (not sudo-type root) seems to have solved this problem for now. Autossh was being run as the main user, which I think gets loaded before lower-privilege user-defined scripts get loaded.
I was using some gcloud commands from the user account which had its own authentication issues. Running gcloud auth login as the user and ensuring correct permissions on my private key solved this.
Always remember to check the messages and syslog files in /var/log for troubleshooting. This allowed me to see the order of things being loaded at system-boot.

AWS EMR jupyter password

im using EMR and wanted to use jupyter(ipython) so i added to the cluster the bootstrap action:
s3://elasticmapreduce.bootstrapactions/ipython-notebook/install-ipython-notebook
I performed the port tunelling to access jupyter from my local host and works fine, but it is asking for a login password, tried empty, tried hadoop, but no luck, does any body knows what is the jypyter password?
I ran into this problem as well when I used the same bootstrap action. I tried adding in Args=[--password, jupyter] which I also could not get working. That was from this aws forum:
Name='Install Jupyter notebook',Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh",Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8880,--password,jupyter,--jupyterhub,--jupyterhub-port,8001,--cached-install,--notebook-dir,s3://<your-s3-bucket>/notebooks/,--copy-samples]
What I did instead was to follow these instructions for installing anaconda directly in the EMR instance using the CLI. If you follow the first part you should be able to get it up and running. To summarize here:
ssh into your master emr instance using the .pem file you saved
once there's you'll want to install anaconda using super user priveledges: sudo wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh. Then bash Anaconda3–4.1.1-Linux-x86_64.sh
Make sure you're using the anaconda version of python: which python
If you're not, specify your source: source .bashrc
Now make a jupyter config file: jupyter notebook --generate-config
cd into the jupyter folder: cd ~/.jupyter/
update the config file: vi jupyter_notebook_config.py
In the config file add the following lines:
c = get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 6789 <---pick whichever port you want
exit out of the config editor and run jupyter via: jupyter notebook
this should run a notebook with no active kernels (for now). But it will give you the token you're looking for: http://localhost:6789/?token=xxxxxx
Leave this running, and open a new terminal window. Now you'll want to tunnel to the EMR instance per this aws blog post (make the port the same as the one you specified in the config file). ssh -o ServerAliveInterval=10 -i <<credentials.pem>> -N -L 8192:<<master-public-dns-name>>:8192 hadoop#<<master-public-dns-name>>
Opening localhost:6789 in the browser should prompt you with the jupyter page to enter your password or token. Enter the token that was generated in the above step and you should be good to go.
Hope this helps! There might be a less convoluted way, but this is what ended up working for me.