Add a startup and shutdown configuration to AWS SageMaker notebook instances - amazon-web-services

I have 2 separate lifecycle configurations, one which will handle installation of various packages during the startup and one which will frequently check if the instance has been idle for a long period of time. Is there a way to combine both of these configurations into one and pass that as a lifecycle configuration associated with notebooks that I create?

You can write a single script for both and add it as an LCC to your notebooks. For example, you can add your pip install or conda install statements to this LCC script - https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/auto-stop-idle/on-start.sh and set it up as an LCC for your notebook.
You can also test it out by simply running the combined script from the terminal to ensure it runs without any errors, before associating it with the notebook.

Related

Run a python script on an EC2 instance without uploading the python file to the instance

I currently have an EC2 instance running with directories of tenants and files in those directories. I have to run a python script to go into those directories, find out how much memory each tenant is using, how many files each tenant has, and send that informatin to a grafana dashboard. I have completed the python script and it will go into each tenant, calculate what I need, and send that information to the grafana dashboard if i manually upload the python script to the instance and run it from the command line.
The goal is to automate this process so the script will run every 15 minutes without ever having to upload the python script to the instance. There is a lot of red tape and I am unable to make the script a part of the AMI when the instance is launched, and I haven't found any examples of people trying to do this before.
First of all is what I am trying to do even possible? Ideally I'd like to run the script from a lambda because that would make it very easy to schedule every 15 minutes and my dependencies for the python script would be very easy to put into place. Suggestions have been brought up about using CodeDeploy but I don't know enough about it to know how that would help.
I have created a python script that works and will run on an ec2 instance if it is uploaded and run from the command line, but I haven't been able to run the script "remotely" as I wold like to.
If you are doing specific tasks within the EC2 file or memory system, running it as a Lambda is not going to work. Instead look into using SSM to run a remote script on an EC2 instance.
https://aws.amazon.com/getting-started/hands-on/remotely-run-commands-ec2-instance-systems-manager/

Make custom kernels available to SageMaker Studio notebooks

On starting the SageMaker Studio server, I can only see a set of predefined kernels when
I select kernel for any notebook.
I create conda environments and persist them between sessions by pointing .condarc to a custom miniconda directory stored on EFS.
I want all notebooks to have access to environments stored in the custom miniconda directory. I can do that on the system terminal but can't seem to find a way to make the kernels available to notebooks.
I am aware of Life Cycle Configuration but that seems to be working only with notebooks instances rather than SageMaker Studio.
Desired outcomes
Ideally making custom kernels persistently available to notebooks but if that isn't feasible or requires custom docker image, I am happy with running a script manually every time I run the server.
What I have tried so far:
I ran the following which is a tweaked version of start.sh meant to be for Life Cycle Configuration.
#!/bin/bash
set -e
sudo -u sagemaker-user -i <<'EOF'
unset SUDO_UID
WORKING_DIR=/home/sagemaker-user/.SageMaker/custom-miniconda/
source "$WORKING_DIR/miniconda/bin/activate"
for env in $WORKING_DIR/miniconda/envs/*; do
BASENAME=$(basename "$env")
source activate "$BASENAME"
python -m ipykernel install --user --name "$BASENAME" --display-name "$BASENAME"
done
EOF
That didn't work and I couldn't access the kernels from the notebooks.
If you need a persistent custom kernel in SageMaker studio, you can create an ECR repository and build a docker image with custom environment configurations. This image can then be attached to the SageMaker studio notebooks. Reference link!
SageMaker studio now also supports the use of lifecycle configurations. Reference link!

Can you load standard zeppelin interpreter settings from S3?

Our company is building up a suite of common internal Spark functions and jobs, and I'd like to make sure that our data scientists have access to all of these when they prototype in Zeppelin.
Ideally, I'd like a way for them to start up a Zeppelin notebook on AWS EMR, and have the dependency jar we build automatically loaded onto it without them having to manually type in the maven information manually every time (private repo location/credentials, package info, etc).
Right now we have the dependency jar loaded on S3, and with some work we could get a private maven repository to host it on.
I see that ZEPPELIN_INTERPRETER_DIR saves off interpreter settings, but I don't think it can load from a common default location (like S3, or something)
Is there a way to tell Zeppelin on an EMR cluster to load it's interpreter settings from a common location? I can't be the first person to want this.
Other thoughts I've had but have not tried yet:
Have a script that uses aws cmd line options to start a EMR cluster with all the necessary settings pre-made for you. (Could also upload the .jar dependency if we can't get maven to work)
Use a infrastructure-as-code framework to start up the clusters with the required settings.
I don't believe it's possible to tell EMR to load settings from a common location. The first thought you included is the way to go imo - you would aws emr create ... and that create would include a shell script step to replace /etc/zeppelin/conf.dist/interpreter.json by downloading the interpreter.json of interest from S3, and then hard restart zeppelin (sudo stop zeppelin; sudo start zeppelin).

Spark standalone mode on AWS EMR

I'm able to run Spark on AWS EMR without much trouble following the documentation but from what I see it always uses YARN instead of the standalone manager. Is there any way to use the standalone mode instead of YARN easily? I don't really feel like hacking the bootstrap scripts to turn off yarn and deploy spark master/workers myself.
I'm running into a weird YARN related bug and I was hoping it won't happen with standalone manager.
As far as I know there are no way to run in standalone mode on EMR unless you go back to the old ami-versions instead of using the emr-release-label. The old ami-version will however cause other problems with newer versions of Spark, so I wouldn't go that way.
What you can do is to launch ordinary EC2-instances with Spark instead of using EMR. If you have a local Spark installation, go to the ec2 folder and use spark-ec2 to launch the cluster, like this:
./spark-ec2 --copy-aws-credentials --key-pair=MY_KEY --identity-file=MY_PEM_FILE.pem --region=MY_PREFERED_REGION --instance-type=INSTANCE_TYPE --slaves=NUMBER_OF_SLAVES --hadoop-major-version=2 --ganglia launch NAME_OF_JOB
I suspect that you have jar-files that are needed, so they have to be copied onto the cluster (copy to master first, ssh to master and copy them onto the slaves from there. ./spark-ec2/copy-dir on master will copy a directory onto all slaves). Then restart Spark:
./spark/sbin/stop-master.sh
./spark/sbin/stop-slaves.sh
./spark/sbin/start-master.sh
./spark/sbin/start-slaves.sh
and you are ready to launch Spark in standalone mode:
./spark/bin/spark-submit --deploy-mode client ...

WebHCat on Amazon's EMR?

Is it possible or advisable to run WebHCat on an Amazon Elastic MapReduce cluster?
I'm new to this technology and I was wonder if it was possible to use WebHCat as a REST interface to run Hive queries. The cluster in question is running Hive.
I wasn't able to get it working but WebHCat is actually installed by default on Amazon's EMR instance.
To get it running you have to do the following,
chmod u+x /home/hadoop/hive/hcatalog/bin/hcat
chmod u+x /home/hadoop/hive/hcatalog/sbin/webhcat_server.sh
export TEMPLETON_HOME=/home/hadoop/.versions/hive-0.11.0/hcatalog/
export HCAT_PREFIX=/home/hadoop/.versions/hive-0.11.0/hcatalog/
/home/hadoop/hive/hcatalog/webhcat_server.sh start
You can then confirm that it's running on port 50111 using curl,
curl -i http://localhost:50111/templeton/v1/status
To hit 50111 on other machines you have to open the port up in the EC2 EMR security group.
You then have to configure the users you going to "proxy" when you run queries in hcatalog. I didn't actually save this configuration, but it is outlined in the WebHCat documentation. I wish they had some concrete examples there but basically I ended up configuring the local 'hadoop' user as the one that run the queries, not the most secure thing to do I am sure, but I was just trying to get it up and running.
Attempting a query then gave me this error,
{"error":"Server IPC version 9 cannot communicate with client version
4"}
The workaround was to switch off of the latest EMR image (3.0.4 with Hadoop 2.2.0) and switch to a Hadoop 1.0 image (2.4.2 with Hadoop 1.0.3).
I then hit another issues where it couldn't find the Hive jar properly, after struggling with the configuration more, I decided I had dumped enough time into trying to get this to work and decided to communicate with Hive directly (using RBHive for Ruby and JDBC for the JVM).
To answer my own question, it is possible to run WebHCat on EMR, but it's not documented at all (Googling lead me nowhere which is why I created this question in the first place, it's currently the first hit when you search "WebHCat EMR") and the WebHCat documentation leaves a lot to be desired. Getting it to work seems like a pain, though my hope is that by writing up the initial steps someone will come along and take it the rest of the way and post a complete answer.
I did not test it but, it should be doable.
EMR allows to customise the bootstrap actions, i.e. the scripts run where the nodes are started. You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster
See more details at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html.
I would create a shell script to install WebHCat and test your script on a regular EC2 instance first (outside the context of EMR - just as a test to ensure your script is OK)
You can use EC2's user-data properties to test your script, typically :
#!/bin/bash
curl http://path_to_your_install_script.sh | sh
Then - once you know the script is working - make it available to the cluster on a S3 bucket and follow these instructions to include your script as custom bootstrap action of your cluster.
--Seb