Discrepancy between AWS Glue and its Dev Endpoint - amazon-web-services

My understanding is Dev Endpoints in AWS Glue can be used to develop code iteratively and then deploy it to a Glue job. I find this specially useful when developing Spark jobs because every time you run a job, it takes several minutes to launch a Hadoop cluster in the background. However, I am seeing a discrepancy when using Python shell in Glue instead of Spark. Import pg doesn't work in a Dev Endpoint I created using Sagemaker JupyterLab Python notebook, but works in AWS Glue when I create a job using Python shell. Shouldn't the same libraries exist in the dev endpoint that exist in Glue? What is the point of having a dev endpoint if you cannot reproduce the same code in both places (dev endpoint and the Glue job)?

Firstly, Python shell jobs would not launch a Hadooo Cluster in the backend as it does not give you a Spark environment for your jobs.
Secondly, since PyGreSQL is not written in Pure Python, it will not work with Glue's native environment (Glue Spark Job, Dev endpoint etc)
Thirdly, Python Shell has additional support for certain package built-in.
Thus, I don't see a point of using DevEndpoint for Python Shell jobs.

Related

Connect Databricks cluster with local machine (AWS)

I want to connect to Databricks cluster (AWS) from my local machine but I want to execute the entire code in the cluster. With Databricks Connect only the spark code is executed in the cluster. I'm looking for alternative solution. SSH interpreter or something similar to that. I work with PyCharm (IDE).
I would go with such a approach (but you need to write small script for your IDE):
you commit to some brunch in git (like staging)
your IDE executes databricks cli command "databricks repos update" which will perform pull
your IDE executes databricks cli job command to run notebook from repo
Databricks cli can be executed as a Rest Api, bash/cmd or can be imported as sdk to programming language

Set Spark version for Sagemaker on Glue Dev Endpoint

To create my Glue scripts, I use development endpoints with Sagemaker notebooks that run the Pyspark (Sparkmagic) kernel.
The latest version of Glue (version 1.0) supports Spark 2.4. However, my Sagemaker notebook uses Spark version 2.2.1.
The function I want to test only exists as of Spark 2.3.
Is there a way to solve this mismatch between the dev endpoint and the Glue job? Can I somehow set the Spark version of the notebook?
I couldn't find anything in the documentation.
When you create a SageMaker notebook for the Glue dev endpoint, it launches a SageMaker notebook instance with a specific lifecycle configuration. This LC provides the configurations to create a connection between the SageMaker notebook and the development endpoint. Upon running cells from the PySpark kernel, the code is sent to the Livy server running in the development endpoint via REST APIs.
Thus, the PySpark version that you see and on which the SageMaker notebook runs depends on the development endpoint and is not configurable from the SageMaker point of view.
Since Glue is a managed service, root access is restricted for the development endpoint. Thus, you cannot update the spark version to a more later version. The feature of using Spark version 2.4 has been newly introduced in Glue and it seems that it has not yet been released for dev endpoint.

How to test AWS Glue code without dev endpoint

I would like to avoid AWS dev endpoint. Is there a way where I can test and debug my PySpark code without using AWS dev endpoint with the help of testing my code in local notebook/IDE?
As others have said, it depends on which part of the Glue are you going to use. If your code is based on pure Spark, without the Dynamic Frames etc. Then local version of Spark may suffice, if however you are intending on using Glue extensions, there is not really an option of not using the Dev End point at this stage.
I hope that this helps.
If you are going to deploy your pyspark code on AWS Glue service, you may have to use GlueContext & other AWS Glue APIs. So if you would like to test against AWS Glue service, using these AWS Glue APIs then you have to have an AWS Dev Endpoint.
However having a AWS Glue notebook is optional, since you can setup zeppelin, etc. establish an ssh tunnel connection with AWS Glue DEP for dev / testing from local env. Make sure you delete the DEPoint once your development/testing is done for the day.
Alternately, if you are not keen on using AWS Glue APIs other than GlueContext, then yes, you can setup zeppelin in local environment, test the code locally and then upload your code to S3, create a Glue job for testing in AWS Glue Service
We have a setup here, where we have pyspark install locally and we use VSCode to develop our pyspark codes, unit test, and debug. We run the codes against the local pyspark installation during development, then we deploy those codes to EMR to run with real dataset.
I'm not sure how much of this apply to what you're trying to do with Glue, as it's a level higher in abstraction.
We use pytest to test pyspark code. We keep pyspark code in another file and calls those functions inglue code file. With this separation, we can unit test pyspark code using pytest
I was able to test without dev endpoints
Please follow the instructions here
https://support.wharton.upenn.edu/help/glue-debugging

How to run glue script from Glue Dev Endpoint

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards

How to run a Spark jar file from AWS Console without Spark-Shell

I'm trying to run a Spark application on the AWS EMR Console (Amazon Web Services). My Scala script compiled in the jar takes the SparkConf settings as parameters or just strings:
val sparkConf = new SparkConf()
.setAppName("WikipediaGraphXPageRank")
.setMaster(args(1))
.set("spark.executor.memory","1g")
.registerKryoClasses(Array(classOf[PRVertex], classOf[PRMessage]))
However, I don't know how to pass the Master-URL parameter and other parameters to the jar when it's uploaded and I set-up the cluster. To be clear, I'm aware that if I was running the Spark-Shell I would do this another way, but I'm a Windows user and with the current set-up and work I've done, it would be very useful to have some way to pass the master URL to the EMR cluster in the 'steps'.
I don't want to use the Spark-Shell, I have a close deadline and have everything set-up this way and feels like just this small issue of passing the master URL as a parameter should be possible, considering that AWS have a guide for running stand-alone Spark applications on EMR.
Help would be appreciated!
Here are instructions on using spark-submit via EMR Step: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md