AWS Sagemaker - Install External Library and Make it Persist - amazon-web-services

I have a sagemaker instance up and running and I have a few libraries that I frequently use with it but each time I restart the instance they get wiped and I have to reinstall them. Is it possible to install my libraries to one of the anaconda environments and have the change remain?

The supported way to do this for Sagemaker notebook instances is with Lifecycle Configurations.
You can create an onStart lifecycle hook that can install the required packages into the respective Conda environments each time your notebook instance starts.
Please see the following blog post for more details
https://aws.amazon.com/blogs/machine-learning/customize-your-amazon-sagemaker-notebook-instances-with-lifecycle-configurations-and-the-option-to-disable-internet-access/

When creating you model, you can specify the requirements.txt as an environment variable.
For Eg.
env = {
'SAGEMAKER_REQUIREMENTS': 'requirements.txt', # path relative to `source_dir` below.
}
sagemaker_model = TensorFlowModel(model_data = 's3://mybucket/modelTarFile,
role = role,
entry_point = 'entry.py',
code_location = 's3://mybucket/runtime-code/',
source_dir = 'src',
env = env,
name = 'model_name',
sagemaker_session = sagemaker_session,
)
This would ensure that the requirements file is run after the docker container is created, before running any code on it.

Related

Airflow BashOperator - Use different role then its pod role

I've tried to run the following commands as part of a bash script runs in BashOperator:
aws cli ls s3://bucket
aws cli cp ... ...
The script runs successfully, however the aws cli commands return error, showing that aws cli doesn't run with the needed permissions (as was defined in airflow-worker-node role)
Investigating the error:
I've upgraded awscli in the docker running the pod - to version 2.4.9 (I've understood that old version of awscli doesn't support access to s3 based on permission grant by aws role
I've Investigated the pod running my bash_script using the BashOperator:
Using k9s, and D (describe) command:
I saw that ARN_ROLE is defined correctly
Using k9s, and s (shell) command:
I saw that pod environment variables are correct.
aws cli worked with the needed permissions and can access s3 as needed.
aws sts get-caller-identity - reported the right role (airflow-worker-node)
Running the above commands as part of the bash-script which was executed in the BashOperator gave me different results:
Running env showed limited amount of env variables
aws cli returned permission related error.
aws sts get-caller-identity - reported the eks role (eks-worker-node)
How can I grant aws cli in my BashOperator bash-script the needed permissions?
Reviewing the BashOperator source code, I've noticed the following code:
https://github.com/apache/airflow/blob/main/airflow/operators/bash.py
def get_env(self, context):
"""Builds the set of environment variables to be exposed for the bash command"""
system_env = os.environ.copy()
env = self.env
if env is None:
env = system_env
else:
if self.append_env:
system_env.update(env)
env = system_env
And the following documentation:
:param env: If env is not None, it must be a dict that defines the
environment variables for the new process; these are used instead
of inheriting the current process environment, which is the default
behavior. (templated)
:type env: dict
:param append_env: If False(default) uses the environment variables passed in env params
and does not inherit the current process environment. If True, inherits the environment variables
from current passes and then environment variable passed by the user will either update the existing
inherited environment variables or the new variables gets appended to it
:type append_env: bool
If bash operator input env variables is None, it copies the env variables of the father process.
In my case, I provided some env variables therefore it didn’t copy the env variables of the father process into the chid process - which caused the child process (the BashOperator process) to use the default arn_role of eks-worker-node.
The simple solution is to set the following flag in BashOperator(): append_env=True which will append all existing env variables to the env variables I added manually.
I've figured out that in the version I'm running (2.0.1) it isn't supported (it is supported in later versions).
As a temp solution I've add **os.environ - to the BashOperator env parameter:
return BashOperator(
task_id="copy_data_from_mcd_s3",
env={
"dag_input": "{{ dag_run.conf }}",
......
**os.environ,
},
# append_env=True,- should be supported in 2.2.0
bash_command="utils/my_script.sh",
dag=dag,
retries=1,
)
Which solve the problem.

Launch and install applications on EC2 instance using aws sdk

I don't know If this will be possible at the first place.
The requirement is to launch ec2 instance using aws sdk (I know this is possible) using based on some application login.
Then I wanted to install some application on the newly launched instance let's say docker.
Is this possible using sdk? Or My idea itself is wrong and there is a better solution to the scenario?
Can i run a command on a Running instance using SDK?
Yes you can install anything on EC2 when it is launched by providing script/commands in userdata section. This is also possible from AWS SDK https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_UserData.html
you can pass command like yum install docker in userdata
UserData='yum install docker'
installing applications uysing cmd on a running command is possible using boto3.
ssm_client = boto3.client('ssm')
response = ssm_client.send_command(
InstanceIds=['i-xxxxxxx'],
DocumentName="AWS-RunShellScript",
Parameters={'commands': ['echo "abc" > 1.txt']}, )
command_id = response['Command']['CommandId']
output = ssm_client.get_command_invocation(
CommandId=command_id,
InstanceId='i-xxxxxxxx',
)
print(output)
ssm client Runs commands on one or more managed instances.
You can also optimize this using function
def execute_commands(ssm_client, command, instance_id):
response = client.send_command(
DocumentName="AWS-RunShellScript", #preconfigured documents
Parameters={'commands': commands},
InstanceIds=instance_ids,
)
return resp
ssm_client = boto3.client('ssm')
commands = ['ifconfig']
instance_ids = ['i-xxxxxxxx']
execute_commands_on_linux_instances(ssm_client, commands, instance_ids)

Requirements for launching Google Cloud AI Platform Notebooks with custom docker image

On AI Platform Notebooks, the UI lets you select a custom image to launch. If you do so, you're greeted with an info box saying that the container "must follow certain technical requirements":
I assume this means they have a required entrypoint, exposed port, jupyterlab launch command, or something, but I can't find any documentation of what the requirements actually are.
I've been trying to reverse engineer it without much luck. I nmaped a standard instance and saw that it had port 8080 open, but setting my image's CMD to run Jupyter Lab on 0.0.0.0:8080 did not do the trick. When I click "Open JupyterLab" in the UI, I get a 504.
Does anyone have a link to the relevant docs, or experience with doing this in the past?
There are two ways you can create custom containers:
Building a Derivative Container
If you only need to install additional packages, ou should create a Dockerfile derived from one of the standard images (for example, FROM gcr.io/deeplearning-platform-release/tf-gpu.1-13:latest), then add RUN commands to install packages using conda/pip/jupyter.
The conda base environment has already been added to the path, so no need to conda init/conda activate unless you need to setup another environment. Additional scripts/dynamic environment variables that need to be run prior to bringing up the environment can be added to /env.sh, which is sourced as part of the entrypoint.
For example, let’s say that you have a custom built TensorFlow wheel that you’d like to use in place of the built-in TensorFlow binary. If you need no additional dependencies, your Dockerfile will be similar to:
Dockerfile.example
FROM gcr.io/deeplearning-platform-release/tf-gpu:latest
RUN pip uninstall -y tensorflow-gpu && \
pip install -y /path/to/local/tensorflow.whl
Then you’ll need to build and push it somewhere accessible to your GCE service account.
PROJECT="my-gcp-project"
docker build . -f Dockerfile.example -t "gcr.io/${PROJECT}/tf-custom:latest"
gcloud auth configure-docker
docker push "gcr.io/${PROJECT}/tf-custom:latest"
Building Container From Scratch
The main requirement is that the container must expose a service on port 8080.
The sidecar proxy agent that executes on the VM will ferry requests to this port only.
If using Jupyter, you should also make sure your jupyter_notebook_config.py is configured as such:
c.NotebookApp.token = ''
c.NotebookApp.password = ''
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8080
c.NotebookApp.allow_origin_pat = (
'(^https://8080-dot-[0-9]+-dot-devshell\.appspot\.com$)|'
'(^https://colab\.research\.google\.com$)|'
'((https?://)?[0-9a-z]+-dot-datalab-vm[\-0-9a-z]*.googleusercontent.com)')
c.NotebookApp.allow_remote_access = True
c.NotebookApp.disable_check_xsrf = False
c.NotebookApp.notebook_dir = '/home'
This disables notebook token-based auth (auth is instead handled through oauth login on the proxy), and allows cross origin requests from three sources: Cloud Shell web preview, colab (see this blog post), and the Cloud Notebooks service proxy. Only the third is required for the notebook service; the first two support alternate access patterns.
To complete Zain's answer, below you can find a minimal example using official Jupyter image, inspired by this repo https://github.com/doitintl/AI-Platform-Notebook-Using-Custom-Container:
Dockerfile
FROM jupyter/base-notebook:python-3.9.5
EXPOSE 8080
ENTRYPOINT ["jupyter", "lab", "--ip", "0.0.0.0", "--allow-root", "--config", "/etc/jupyter/jupyter_notebook_config.py"]
COPY jupyter_notebook_config.py /etc/jupyter/
jupyter_notebook_config.py
(almost the same as Zain's, but with an extra pattern enabling the communication with the kernel; the communication didn't work without it)
c.NotebookApp.ip = '*'
c.NotebookApp.token = ''
c.NotebookApp.password = ''
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8080
c.NotebookApp.allow_origin_pat = '(^https://8080-dot-[0-9]+-dot-devshell\.appspot\.com$)|(^https://colab\.research\.google\.com$)|((https?://)?[0-9a-z]+-dot-datalab-vm[\-0-9a-z]*.googleusercontent.com)|((https?://)?[0-9a-z]+-dot-[\-0-9a-z]*.notebooks.googleusercontent.com)|((https?://)?[0-9a-z\-]+\.[0-9a-z\-]+\.cloudshell\.dev)|((https?://)ssh\.cloud\.google\.com/devshell)'
c.NotebookApp.allow_remote_access = True
c.NotebookApp.disable_check_xsrf = False
c.NotebookApp.notebook_dir = '/home'
c.Session.debug = True
And finally, think about this page while troubleshooting: https://cloud.google.com/notebooks/docs/troubleshooting

Automatically "stop" Sagemaker notebook instance after inactivity?

I have a Sagemaker Jupyter notebook instance that I keep leaving online overnight by mistake, unnecessarily costing money...
Is there any way to automatically stop the Sagemaker notebook instance when there is no activity for say, 1 hour? Or would I have to make a custom script?
You can use Lifecycle configurations to set up an automatic job that will stop your instance after inactivity.
There's a GitHub repository which has samples that you can use. In the repository, there's a auto-stop-idle script which will shutdown your instance once it's idle for more than 1 hour.
What you need to do is
to create a Lifecycle configuration using the script and
associate the configuration with the instance. You can do this when you edit or create a Notebook instance.
If you think 1 hour is too long you can tweak the script. This line has the value.
You could also use CloudWatch + Lambda to monitor Sagemaker and stop when your utilization hits a minimum. Here is a list of what's available in CW for SM: https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html.
For example, you could set a CW alarm to trigger when CPU utilization falls below ~5% for 30 minutes and have that trigger a Lambda which would shut down the notebook.
After we've burned quite a lot of money by forgetting to turn off these machines, I've decided to create a script. It's based on AWS' script, but provides an explanation why the machine was or was not killed. It's pretty lightweight because it does not use any additional infrastructure like Lambda.
Here is the script and the guide on installing it! It's just a simple lifecycle configuration!
Unfortunately, automatically stopping the Notebook Instance when there is no activity is not possible in SageMaker today. To avoid leaving them overnight, you can write a cron job to check if there's any running Notebook Instance at night and stop them if needed.
SageMaker Studio Notebook Kernels can be terminated by attaching the following lifecycle configuration script to the domain.
#!/bin/bash
# This script installs the idle notebook auto-checker server extension to SageMaker Studio
# The original extension has a lab extension part where users can set the idle timeout via a Jupyter Lab widget.
# In this version the script installs the server side of the extension only. The idle timeout
# can be set via a command-line script which will be also created by this create and places into the
# user's home folder
#
# Installing the server side extension does not require Internet connection (as all the dependencies are stored in the
# install tarball) and can be done via VPCOnly mode.
set -eux
# timeout in minutes
export TIMEOUT_IN_MINS=120
# Should already be running in user home directory, but just to check:
cd /home/sagemaker-user
# By working in a directory starting with ".", we won't clutter up users' Jupyter file tree views
mkdir -p .auto-shutdown
# Create the command-line script for setting the idle timeout
cat > .auto-shutdown/set-time-interval.sh << EOF
#!/opt/conda/bin/python
import json
import requests
TIMEOUT=${TIMEOUT_IN_MINS}
session = requests.Session()
# Getting the xsrf token first from Jupyter Server
response = session.get("http://localhost:8888/jupyter/default/tree")
# calls the idle_checker extension's interface to set the timeout value
response = session.post("http://localhost:8888/jupyter/default/sagemaker-studio-autoshutdown/idle_checker",
json={"idle_time": TIMEOUT, "keep_terminals": False},
params={"_xsrf": response.headers['Set-Cookie'].split(";")[0].split("=")[1]})
if response.status_code == 200:
print("Succeeded, idle timeout set to {} minutes".format(TIMEOUT))
else:
print("Error!")
print(response.status_code)
EOF
chmod +x .auto-shutdown/set-time-interval.sh
# "wget" is not part of the base Jupyter Server image, you need to install it first if needed to download the tarball
sudo yum install -y wget
# You can download the tarball from GitHub or alternatively, if you're using VPCOnly mode, you can host on S3
wget -O .auto-shutdown/extension.tar.gz https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension/raw/main/sagemaker_studio_autoshutdown-0.1.5.tar.gz
# Or instead, could serve the tarball from an S3 bucket in which case "wget" would not be needed:
# aws s3 --endpoint-url [S3 Interface Endpoint] cp s3://[tarball location] .auto-shutdown/extension.tar.gz
# Installs the extension
cd .auto-shutdown
tar xzf extension.tar.gz
cd sagemaker_studio_autoshutdown-0.1.5
# Activate studio environment just for installing extension
export AWS_SAGEMAKER_JUPYTERSERVER_IMAGE="${AWS_SAGEMAKER_JUPYTERSERVER_IMAGE:-'jupyter-server'}"
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
eval "$(conda shell.bash hook)"
conda activate studio
fi;
pip install --no-dependencies --no-build-isolation -e .
jupyter serverextension enable --py sagemaker_studio_autoshutdown
if [ "$AWS_SAGEMAKER_JUPYTERSERVER_IMAGE" = "jupyter-server-3" ] ; then
conda deactivate
fi;
# Restarts the jupyter server
nohup supervisorctl -c /etc/supervisor/conf.d/supervisord.conf restart jupyterlabserver
# Waiting for 30 seconds to make sure the Jupyter Server is up and running
sleep 30
# Calling the script to set the idle-timeout and active the extension
/home/sagemaker-user/.auto-shutdown/set-time-interval.sh
Resource
https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html
https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples/blob/main/scripts/install-autoshutdown-server-extension/on-jupyter-server-start.sh

AWS EMR (4.x-5.x) classpath for custom jar step

When adding a custom jar step for an EMR cluster - how do you set the classpath to a dependent jar (required library)?
Let's say I have my jar file - myjar.jar but I need an external jar to run it - dependency.jar. Where do you configure this when creating the cluster? I am not using the command line, using the Advanced Options interface.
Thought I would post this after spending a number of hours poking around and reading outdated documentation.
The 2.x/3.x documentation that talks about setting the HADOOP_CLASSPATH does not work. They specify this does not work for 4.x and above anyway. Somewhere you need to specify a --libjars option. However, specifying that in the arguments list does not work either.
For example:
Step Name: MyCustomStep
Jar Location: s3://somebucket/myjar.jar
Arguments:
myclassname
option1
option2
--libjars dependentlib.jar
Copy your required jars to /usr/lib/hadoop-mapreduce/ in a bootstrap action. No other changes are necessary. Additional info below:
This command below works for me to copy a specific JDBC driver version:
sudo aws s3 cp s3://<your bucket>/mysql-connector-java-5.1.23-bin.jar /usr/lib/hadoop-mapreduce/
I have other dependencies so I have a bootstrap action for each jar I need copied, of course you could put all the copies in a single bash script. Below is .net code I use to get a bootstrap action to run the copy script. I am using .net SDK versions 3.3.* and launching the job with release label emr-5.2.0
public static BootstrapActionConfig CopyEmrJarDependency(string jarName)
{
return new BootstrapActionConfig()
{
Name = $"Copy jars for EMR dependency: {jarName}",
ScriptBootstrapAction = new ScriptBootstrapActionConfig()
{
Path = $"s3n://{Config.AwsS3CodeBucketName}/EMR/Scripts/copy-thirdPartyJar.sh",
Args = new List<string>()
{
$"s3://{Config.AwsS3CodeBucketName}/EMR/Java/lib/{jarName}",
"/usr/lib/hadoop-mapreduce/"
}
}
};
}
Note that the ScriptBootstrapActionConfig Path property uses the protocol "s3n://", but the protocol for the aws cp command should be "s3://"
My script copy-thirdPartyJar.sh contains the following:
#!/bin/bash
# $1 = location of jar
# $2 = attempted magic directory for java classpath
sudo aws s3 cp $1 $2