AWS glue interactive session in sagemaker notebooks via lifecycle configurations - amazon-web-services

I am trying to work with glue interective session in sagemaker notebook by configuring the glue-conda-pyspark kernel via aws lifecycle configurations. It worked earlier while creating a notebook instance. Now the instance is running with configuration but i am no longer able to see the conda glue pyspark kernel in the kernel list. Could anybody help with the create script and start script to run the notebook with glue-pyspark.
I am configuring using this aws doc: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-sagemaker.html#is-sagemaker-existing
and also aws took help from aws github scripts: https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/install-conda-package-single-environment/on-start.sh

Related

Does terraform provide a resource to create emr notebooks?

I have an transient Emr cluster up and ready, I want to run a simple pyspark script on the emr notebook.
Is there any way to create and modify the emr notebook through terraform?
Thanks in advance.
As far as i know, AWS says "You create an EMR notebook using the Amazon EMR console. Creating notebooks using the AWS CLI or the Amazon EMR API is not supported." [AWS Documentation on creating EMR Notebook][1]
You can create a notebook via console, the notebook will be stored in S3 as .ipynb, by giving the relative path, you can execute notebook on the cluster. Refer boto3 for more info [Boto3 Documentation][2]
[1]: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-create.html
[2]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.start_notebook_execution
Yes, you can create and modify an EMR cluster from Terraform and choose which tools will be installed, but this seems like the "hard way". Easier would be a Sagemaker Notebook or using the new Glue Databrew tool.

How can solve a scheduling problem a .ipnyb notebook in Sagemaker using AWS lambda and Lifecycle Configuration?

I want to schedule my .ipynb file with Amazon Lambda. I am following the steps of this publications https://towardsdatascience.com/automating-aws-sagemaker-notebooks-2dec62bc2c84. For notebook instance is working very well starting and stoping, but my .ipynb file is not executing, i wrote as the same above mentioned publication in lifecycle configuration.
Just i change these lines with my notebook instance source
"NOTEBOOK_FILE="/home/ec2-user/SageMaker/Test Notebook.ipynb"
/home/ec2-user/anaconda3/bin/activate "$ENVIRONMENT"
"source /home/ec2-user/anaconda3/bin/deactivate".
Cloudwatch is working very well for notebook instance, but .ipynb file is not executed.
Can someone help me about my problem!
Check out the this aws-sample of how to run a notebook in aws-sagemaker.
This document shows how to install and run the sagemaker-run-notebooks library that lets you run and schedule Jupyter notebook executions as SageMaker Processing Jobs.
This library provides three interfaces to the notebook execution functionality:
A command line interface (CLI)
A Python library
A JupyterLab extension that can be enabled for JupyterLab running locally, in SageMaker Studio, or on a SageMaker notebook instance
https://github.com/aws-samples/sagemaker-run-notebook
Also, check out this example of Scheduling Jupyter notebooks on SageMaker. you can write code in a Jupyter notebook and run it on an Amazon SageMaker ephemeral instance with the click of a button, either immediately or on a schedule. With the tools provided here, you can do this from anywhere: at a shell prompt, in JupyterLab on Amazon SageMaker, in another JupyterLab environment you have, or automated in a program you’ve written.
https://aws.amazon.com/blogs/machine-learning/scheduling-jupyter-notebooks-on-sagemaker-ephemeral-instances/

Set Spark version for Sagemaker on Glue Dev Endpoint

To create my Glue scripts, I use development endpoints with Sagemaker notebooks that run the Pyspark (Sparkmagic) kernel.
The latest version of Glue (version 1.0) supports Spark 2.4. However, my Sagemaker notebook uses Spark version 2.2.1.
The function I want to test only exists as of Spark 2.3.
Is there a way to solve this mismatch between the dev endpoint and the Glue job? Can I somehow set the Spark version of the notebook?
I couldn't find anything in the documentation.
When you create a SageMaker notebook for the Glue dev endpoint, it launches a SageMaker notebook instance with a specific lifecycle configuration. This LC provides the configurations to create a connection between the SageMaker notebook and the development endpoint. Upon running cells from the PySpark kernel, the code is sent to the Livy server running in the development endpoint via REST APIs.
Thus, the PySpark version that you see and on which the SageMaker notebook runs depends on the development endpoint and is not configurable from the SageMaker point of view.
Since Glue is a managed service, root access is restricted for the development endpoint. Thus, you cannot update the spark version to a more later version. The feature of using Spark version 2.4 has been newly introduced in Glue and it seems that it has not yet been released for dev endpoint.

How to run glue script from Glue Dev Endpoint

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards

How to connect Jenkins and aws dynamoDB

I want to store values from Environment Variables of Jenkins to AWS DynamoDb. I request anyone of you could help me how to connect Jenkins and DynamoDb either using manual configuration or using Jenkins shell command.
Thank you in advance
You can use AWS CLI for this, launching the command from your Jenkins pipeline code.
More info here:
Using the AWS CLI with DynamoDB